Introduction

This is a demonstration of visualizing, clustering, and benchmarking data that includes a geographical dimension.

Applying basic analysis of your data can produce novel insights to your business or customers, which if not already a fundamental expectation, could give you a competitive advantage or new value proposition. In the financial and property industries, you might have a portfolio of physical assets such as buildings or houses. Grouping similar properties enable you to benchmark individuals against their peers. A benchmark report gives owners, investors, and managers a tool for evaluating and planning for the performance of their asset.

As a proxy for a ‘real’ portfolio dataset, this example uses data for mountain peaks located in Colorado, USA that have an elevation greater than 14,000 ft (4,270 meters)

Presented first is a simple cluster analysis on the location using latitude and longitude. The model is then expanded to consider other factors like those describing the hiking experience.

The steps

  1. Load & Familiarize Data
  2. Pick factors and consider cluster sizes
  3. Cluster based on distance
  4. Add other factors to cluster analysis
  5. Visualize data on a map
  6. Using the cluster for benchmarking

Data

The data included in this example is a tidy list of Colorado peaks with sixteen variables describing each peak. Here is a description of each variable:

  • ID – A unique Identifier for each row
  • Mountain Peak – The name of the peak
  • Mountain Range – The name of the primary mountain range the peak is a member of
  • Elevation_ft – The peak elevation in feet
  • Fourteener – An indicator if the peak is considered a fourteener and includes a value of Y or N
  • Prominence_ft – How much higher the peak is in feet from the next highest point
  • Isolation_mi – The distance in miles from the nearest point of the same or higher elevation
  • Lat -The latitudinal coordinate in decimal form
  • Long – The longitudinal coordinate in decimal form
  • Standard Route – The name of the most commonly used hiking/climbing route to the peak
  • Distance_mi – The distance of the standard route in miles
  • Elevation Gain_ft – The elevation gain of the standard route in feet
  • Difficulty – The Yosemite Decimal System difficulty rating, a value ranging from Class 1 (easiest) to Class 5 (most difficult)
  • Traffic Low – The low range of estimated visits in the year 2017
  • Traffic High – The high range of estimated visits in the year 2017
  • Photo – A URL to a photo of the peak

Reference the code book for more details about these data elements.

The table below shows the first five rows of the dataset

Mountain.Peak Mountain.Range Elevation_ft Prominence_ft Isolation_mi Lat Long Standard.Route Distance_mi Elevation.Gain_ft Difficulty Traffic.Low Traffic.High
Mount Elbert Sawatch Range 14440 9093 670.00 39.1178 -106.4454 Northeast Ridge 9.50 4700 Class 1 20000 25000
Mount Massive Sawatch Range 14428 1961 5.06 39.1875 -106.4757 East Slopes 14.50 4500 Class 2 7000 10000
Mount Harvard Sawatch Range 14421 2360 14.93 38.9244 -106.3207 South Slopes 14.00 4600 Class 2 5000 7000
Blanca Peak Sangre de Cristo Range 14351 5326 103.40 37.5775 -105.4856 Northwest Ridge 17.00 6500 Hard Class 2 1000 3000
La Plata Peak Sawatch Range 14343 1836 6.28 39.0294 -106.4729 Northwest Ridge 9.25 4500 Class 2 5000 7000

Exploratory Analysis

Starting with peak locations, a quick look at the data gives an initial view as to the geographical spread.

The isolation histogram already hints towards the nature of some clustering in the data. Most peaks have a proximity of 10 miles. Another group seem to be within 10 to 30 miles and then a handful are more isolated. Mt Elbert was excluded from this histogram because it is the highest mountain and the next larger peak is Mount Whitney 670 miles away. With the scope of ‘Colorado Fourteeners’, this isolation figure is an outlier.

The prominence data indicate a little more of an even distribution. Mt Elbert is once again a standout, but this time, it is included because it is not an outlier. Its prominence of 9,093 is nearly double any other peak in the data.

The boxplot with jitters shows the distribution of peak elevations within their respective mountain ranges. The mountain range is a natural clustering mechanism. Studying the jitters reveals possible sub-groupings within a range if elevation is an important factor in the data.

Clustering on Location

We will first conduct a hierarchical cluster analysis based on location in terms of latitude and longitude. Then, interpret the results by comparing how these clusters relate to the mountain ranges.

First, create a matrix where the distance between every peak is calculated. For illustrative purposes, the distance is calculated using the Haversine formula and presented in miles.

This distance matrix is too big to display in its entirety, but below is an example of the first five mountains. You can see the distance between each in miles (as the crow flies).

Mount Elbert Mount Massive Mount Harvard Blanca Peak La Plata Peak
Mount Elbert 0.0
Mount Massive 5.1 0.0
Mount Harvard 15.0 20.0 0.0
Blanca Peak 118.6 123.6 103.6 0.0
La Plata Peak 6.3 10.9 10.9 113.8 0.0

The first five peaks are plotted on a map…

You can see just in the first five data elements, that two or three groups appear depending on how ‘deep’ you look.

Next, we feed the distance matrix into a hierarchical cluster algorithm. You can visualize the results of how the peaks can be grouped with a dendrogram.

The vertical axis lists every mountain peak and the horizontal axis measures the degree of distance between groups. The horizontal axis ranges from zero to > 120. At zero, every mountain is its own cluster, so five clusters. At 120, you can see the data falls into two clusters and at about 20, there are three clusters.

Here is the dendrogram for all peaks

You can use the dendrogram to determine how much distance there s between groups, or how many groups your data falls into. Picking a level to cluster on is called: cutting the tree.

Because the peaks are attributed to one of six mountain ranges, we will cut the tree at a height that gives six groups. The table below compares the hierarchical groups with the mountain ranges.

1 2 3 4 5 6
Elk Mountains 5 0 0 0 0 0
Front Range 0 0 0 5 1 0
Mosquito Range 5 0 0 0 0 0
San Juan Mountains 0 0 12 0 0 0
Sangre de Cristo Range 0 9 0 0 0 1
Sawatch Range 15 0 0 0 0 0

The groupings based on distance is similar to how they are organized in mountain ranges. Group 1 includes peaks from three different mountain ranges. Group 3 clusters all peaks from the San Juan range and Group 2 identifies the Sangre de Cristo Range.

The map below shows the peaks where the color represents each group and the labels show the mountain range. Mouse over to see the labels.

Clustering on More Dimensions

That is a simple example of clustering peaks by location. But, maybe we want to find similar groups of peaks by including more factors. perhaps, we want to group by a hikers experience for example.

The data also includes variables describing the hiking difficulty (on the standard route),

  • Distance_mi – The distance in miles for the standard route
  • Elevation.Gain_ft – The elevation gain along the standard route
  • Difficulty_Rating – A numeric representation of the hike’s class

The table below shows the first five rows

Mountain.Peak Distance_mi Elevation.Gain_ft Difficulty_Rating
Mount Elbert 9.50 4700 1.0
Mount Massive 14.50 4500 2.0
Mount Harvard 14.00 4600 2.0
Blanca Peak 17.00 6500 2.5
La Plata Peak 9.25 4500 2.0

We will run another cluster analysis, but this time including the hiking difficulty variables along with the location (lat & long). Because the factors are measured using different scales, the data will be normalized before calculating distances.

Based on what we know about the data and reviewing the dendrogram, we will cut the tree to produce 8 groups.

Group8 Members
1 16
2 6
3 6
4 11
5 5
6 4
7 1
8 4

The map and boxplot below show how the clustering algorithm grouped similar mountains based on hiker experience factors.

For example, you can see that Group 4 includes the easiest mountains that also happen to be more accessible to people living in Denver and the front range. Group 5 include the most technically challenging peaks.

Benchmarking

If you have a large dataset, especially one with a lot of dimensions, clustering is a great step to pulling out relevant and useful insights. Within the property management context, for example, it can show Building Managers how their performance compares with other similar facilities. It can also show investors revenue and cost benchmarks items for a particular area.

For fourteener hikers, the groupings can narrow the 53 peaks down to a handful that match their interests.

For example, say there is someone who hiked Snowmass mountain and loved the technical challenge.

The person’s interests might best relate to Group five. The radar charts show Snowmass and the other two peaks in Group 5 that are in the Elk Mountains Range

The table and map below show all the mountains in Group 5

Mountain.Peak Mountain.Range Standard.Route Distance_mi Elevation.Gain_ft Difficulty hikeGroup
16 Mount Wilson San Juan Mountains North Slopes 16.00 5100 Class 4 5
29 Capitol Peak Elk Mountains Northeast Ridge 17.00 5300 Class 4 5
31 Snowmass Mountain Elk Mountains East Slopes 22.00 5800 Hard Class 3 5
39 Sunlight Peak San Juan Mountains South Face 17.00 6000 Class 4 5
47 Pyramid Peak Elk Mountains Northeast Ridge 8.25 4500 Class 4 5

Similar Posts