This is a demonstration of visualizing, clustering, and benchmarking data that includes a geographical dimension.
Applying basic analysis of your data can produce novel insights to your business or customers, which if not already a fundamental expectation, could give you a competitive advantage or new value proposition. In the financial and property industries, you might have a portfolio of physical assets such as buildings or houses. Grouping similar properties enable you to benchmark individuals against their peers. A benchmark report gives owners, investors, and managers a tool for evaluating and planning for the performance of their asset.
As a proxy for a ‘real’ portfolio dataset, this example uses data for mountain peaks located in Colorado, USA that have an elevation greater than 14,000 ft (4,270 meters)
Presented first is a simple cluster analysis on the location using latitude and longitude. The model is then expanded to consider other factors like those describing the hiking experience.
- Load & Familiarize Data
- Pick factors and consider cluster sizes
- Cluster based on distance
- Add other factors to cluster analysis
- Visualize data on a map
- Using the cluster for benchmarking
The data included in this example is a tidy list of Colorado peaks with sixteen variables describing each peak. Here is a description of each variable:
- ID – A unique Identifier for each row
- Mountain Peak – The name of the peak
- Mountain Range – The name of the primary mountain range the peak is a member of
- Elevation_ft – The peak elevation in feet
- Fourteener – An indicator if the peak is considered a fourteener and includes a value of Y or N
- Prominence_ft – How much higher the peak is in feet from the next highest point
- Isolation_mi – The distance in miles from the nearest point of the same or higher elevation
- Lat -The latitudinal coordinate in decimal form
- Long – The longitudinal coordinate in decimal form
- Standard Route – The name of the most commonly used hiking/climbing route to the peak
- Distance_mi – The distance of the standard route in miles
- Elevation Gain_ft – The elevation gain of the standard route in feet
- Difficulty – The Yosemite Decimal System difficulty rating, a value ranging from Class 1 (easiest) to Class 5 (most difficult)
- Traffic Low – The low range of estimated visits in the year 2017
- Traffic High – The high range of estimated visits in the year 2017
- Photo – A URL to a photo of the peak
Reference the code book for more details about these data elements.
The table below shows the first five rows of the dataset
|Mount Elbert||Sawatch Range||14440||9093||670.00||39.1178||-106.4454||Northeast Ridge||9.50||4700||Class 1||20000||25000|
|Mount Massive||Sawatch Range||14428||1961||5.06||39.1875||-106.4757||East Slopes||14.50||4500||Class 2||7000||10000|
|Mount Harvard||Sawatch Range||14421||2360||14.93||38.9244||-106.3207||South Slopes||14.00||4600||Class 2||5000||7000|
|Blanca Peak||Sangre de Cristo Range||14351||5326||103.40||37.5775||-105.4856||Northwest Ridge||17.00||6500||Hard Class 2||1000||3000|
|La Plata Peak||Sawatch Range||14343||1836||6.28||39.0294||-106.4729||Northwest Ridge||9.25||4500||Class 2||5000||7000|
Starting with peak locations, a quick look at the data gives an initial view as to the geographical spread.
The isolation histogram already hints towards the nature of some clustering in the data. Most peaks have a proximity of 10 miles. Another group seem to be within 10 to 30 miles and then a handful are more isolated. Mt Elbert was excluded from this histogram because it is the highest mountain and the next larger peak is Mount Whitney 670 miles away. With the scope of ‘Colorado Fourteeners’, this isolation figure is an outlier.
The prominence data indicate a little more of an even distribution. Mt Elbert is once again a standout, but this time, it is included because it is not an outlier. Its prominence of 9,093 is nearly double any other peak in the data.
The boxplot with jitters shows the distribution of peak elevations within their respective mountain ranges. The mountain range is a natural clustering mechanism. Studying the jitters reveals possible sub-groupings within a range if elevation is an important factor in the data.
Clustering on Location
We will first conduct a hierarchical cluster analysis based on location in terms of latitude and longitude. Then, interpret the results by comparing how these clusters relate to the mountain ranges.
First, create a matrix where the distance between every peak is calculated. For illustrative purposes, the distance is calculated using the Haversine formula and presented in miles.
This distance matrix is too big to display in its entirety, but below is an example of the first five mountains. You can see the distance between each in miles (as the crow flies).
|Mount Elbert||Mount Massive||Mount Harvard||Blanca Peak||La Plata Peak|
|La Plata Peak||6.3||10.9||10.9||113.8||0.0|
The first five peaks are plotted on a map…
You can see just in the first five data elements, that two or three groups appear depending on how ‘deep’ you look.
Next, we feed the distance matrix into a hierarchical cluster algorithm. You can visualize the results of how the peaks can be grouped with a dendrogram.
The vertical axis lists every mountain peak and the horizontal axis measures the degree of distance between groups. The horizontal axis ranges from zero to > 120. At zero, every mountain is its own cluster, so five clusters. At 120, you can see the data falls into two clusters and at about 20, there are three clusters.
Here is the dendrogram for all peaks
You can use the dendrogram to determine how much distance there s between groups, or how many groups your data falls into. Picking a level to cluster on is called: cutting the tree.
Because the peaks are attributed to one of six mountain ranges, we will cut the tree at a height that gives six groups. The table below compares the hierarchical groups with the mountain ranges.
|San Juan Mountains||0||0||12||0||0||0|
|Sangre de Cristo Range||0||9||0||0||0||1|
The groupings based on distance is similar to how they are organized in mountain ranges. Group 1 includes peaks from three different mountain ranges. Group 3 clusters all peaks from the San Juan range and Group 2 identifies the Sangre de Cristo Range.
The map below shows the peaks where the color represents each group and the labels show the mountain range. Mouse over to see the labels.
Clustering on More Dimensions
That is a simple example of clustering peaks by location. But, maybe we want to find similar groups of peaks by including more factors. perhaps, we want to group by a hikers experience for example.
The data also includes variables describing the hiking difficulty (on the standard route),
- Distance_mi – The distance in miles for the standard route
- Elevation.Gain_ft – The elevation gain along the standard route
- Difficulty_Rating – A numeric representation of the hike’s class
The table below shows the first five rows
|La Plata Peak||9.25||4500||2.0|
We will run another cluster analysis, but this time including the hiking difficulty variables along with the location (lat & long). Because the factors are measured using different scales, the data will be normalized before calculating distances.
Based on what we know about the data and reviewing the dendrogram, we will cut the tree to produce 8 groups.
The map and boxplot below show how the clustering algorithm grouped similar mountains based on hiker experience factors.
For example, you can see that Group 4 includes the easiest mountains that also happen to be more accessible to people living in Denver and the front range. Group 5 include the most technically challenging peaks.
If you have a large dataset, especially one with a lot of dimensions, clustering is a great step to pulling out relevant and useful insights. Within the property management context, for example, it can show Building Managers how their performance compares with other similar facilities. It can also show investors revenue and cost benchmarks items for a particular area.
For fourteener hikers, the groupings can narrow the 53 peaks down to a handful that match their interests.
For example, say there is someone who hiked Snowmass mountain and loved the technical challenge.
The person’s interests might best relate to Group five. The radar charts show Snowmass and the other two peaks in Group 5 that are in the Elk Mountains Range
The table and map below show all the mountains in Group 5
|16||Mount Wilson||San Juan Mountains||North Slopes||16.00||5100||Class 4||5|
|29||Capitol Peak||Elk Mountains||Northeast Ridge||17.00||5300||Class 4||5|
|31||Snowmass Mountain||Elk Mountains||East Slopes||22.00||5800||Hard Class 3||5|
|39||Sunlight Peak||San Juan Mountains||South Face||17.00||6000||Class 4||5|
|47||Pyramid Peak||Elk Mountains||Northeast Ridge||8.25||4500||Class 4||5|