author ：Pulkit Sharma translate ： Chen Chao proofreading ： Wu Zhendong

this paper ** about 4700 word **, Recommended reading

**15**

**minute**This paper starts from the comparison between unsupervised learning and supervised learning , Combined with specific cases to introduce the concept of hierarchical clustering 、 Application scenarios 、 Main types and Python Realization .

** introduction **

Understanding customer behavior is essential in any industry , I didn't realize it until last year . At that time, my CMO（chief marketing officer, Chief Marketing Officer ） Ask me ：“ Can you tell me , What group of users should our new product target ？”

It's a learning process for me . I soon realized , As a data scientist , How important is it to segment customers so that companies can customize their customers and establish target strategies . This is where the clustering concept comes in handy ！

User classification is often tricky , Because we don't have any target variables in mind . We are now officially entering the field of unsupervised learning , Discover patterns and structures without any set results . It's challenging for data scientists, but it's exciting .

There are several different clustering methods here （ You'll see in the following section ）. I'll introduce you to one of them —— Hierarchical clustering .

We're going to learn what hierarchical clustering is , It is superior to other clustering algorithms , Different levels of clustering methods and development steps . In the end, we will adopt a customer classification database and implement Python Hierarchical clustering . I love this method and I'm sure you'll like it after reading this article ！

notes ： As mentioned above , There are many clustering methods . I encourage you to look at our guidelines for different types of clustering ：

**An Introduction to Clustering and different methods of clustering**

https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/utm_source=blog&utm_medium=beginners-guide-hierarchical-clustering

Want to learn more about clustering and other machine learning algorithms （ Supervised and unsupervised ） Take a look at the following project -

https://courses.analyticsvidhya.com/bundles/certified-ai-ml-blackbelt-plus?utm_source=blog&utm_medium=beginners-guide-hierarchical-clustering

** Catalog **

1. supervise vs Unsupervised learning

2. Why use hierarchical clustering ？

3. What is hierarchical clustering ？

4. The types of hierarchical clustering

（1） Polymerization （Agglomerative） Hierarchical clustering

（2） Split （Divisive） Hierarchical clustering

5. The steps of hierarchical clustering

6. How to select the number of classes in hierarchical clustering ？

7. Using hierarchical clustering to solve a wholesale customer classification problem

** supervise vs Unsupervised learning **

Before we go into hierarchical clustering , It is important to understand the difference between supervised learning and unsupervised learning . Let me explain the difference with a simple example .

Suppose I want to estimate the number of bikes that will be rented each day ：

perhaps , We want to predict whether a person will survive on the Titanic ：

In both cases, there is a fixed goal to achieve ：

In the first example , Based on seasons 、 During the holiday 、 Working day 、 The weather 、 Temperature and other characteristics to predict the number of bicycle rental .

In the second example, predict whether the passenger will survive . stay “ To survive ” variable ,0 It means that the person is not alive ,1 It means that this person survived . The independent variables here include cabin class 、 Gender 、 Age 、 Ticket prices and so on .

** So , When we have a target variable （ In both cases, the number and survival ）, Based on a series of predictors or independent variables （ season , During the holiday , Gender , Age etc. ） To predict the , This problem is called supervised learning .**

Let's take a look at the figure below to better understand it ：

ad locum ,y It's a dependent variable or a target variable ,X For the independent variable . The target variable depends on X, So it's also called a dependent variable . We use independent variables to train the model under the supervision of the target variable , So it's called supervised learning .

Our goal in training models is to generate a function , Ability to map arguments to desired goals . Once the model training is complete , We can put new observations in , The model can predict the target itself . To make a long story short , This process is called supervised learning .

Sometimes we don't have any target variables to predict . There is no explicit target variable for this problem , It's called unsupervised learning . We have only independent variables .

We try to divide the whole data into a series of groups . These groups are called clusters , This process is called clustering .

This technique is often used to cluster populations into different groups . Common examples include customer clustering 、 Clustering similar files 、 Recommend similar songs or movies, etc .

Now there are many algorithms that can help us to cluster . The most commonly used clustering algorithm is K-means And hierarchical clustering .

** Why hierarchical clustering ？**

Before that , We need to know first K-means How it works . believe me , This will make the concept of hierarchical clustering easier .

There's one right here K-means An overview of how algorithms work ：

1． Determine the number of clusters （k）

2． choice k Random points as the center

3． Put all the points in the nearest Center

4． Calculate the center of the new cluster

5． Repeat step 3 and 4

This is an iterative process . It will continue to run , Until the center point of the newly formed cluster does not change , Or the maximum number of iterations .

however K-means There have also been some doubts . It usually tries to generate clusters of the same specification . also , We need to determine the number of clusters before the algorithm starts . Ideally , We don't know how many clusters we need at the beginning of the algorithm , So this is also K-means A kind of question faced by .

This is exactly the advantage of hierarchical clustering . It solves the problem of setting the number of clusters in advance . It sounds like a dream ！ therefore , Let's look at what hierarchical clustering is and how it improves K-means Of .

** What is hierarchical clustering ？**

We have the following points , We want to cluster them ：

We can take each point as a separate cluster ：

Now? , Based on the similarity of these clusters , We can put the most similar clusters together , And repeat this process until you have a single cluster left ：

We need to build a hierarchy of clusters , That's why this algorithm is called hierarchical clustering . We'll discuss how to determine the number of clusters in the next section . Now? , Let's look at different types of hierarchical clustering .

** The types of hierarchical clustering **

There are two main hierarchical clustering ：

1. Aggregate hierarchical clustering

2. Split hierarchical clustering

Let's take a look at each one in detail ：

** Aggregate hierarchical clustering **

Put each point in a separate cluster . Suppose there are four data points here . We divide each point into a cluster , At the beginning, there will be four clusters ：

then , In each iteration , We fuse the most similar pairs of points , Then repeat the above steps until there is only a single cluster ：

Every step of the way we're merging （ Or add ） cluster , Right ？ therefore , This kind of clustering is also called cumulative hierarchical clustering .

** Split hierarchical clustering **

Split hierarchical clustering is the opposite idea . And divide it from the beginning n A cluster of （n An observation ） Different , We started with just one cluster , And put all the points in this cluster .

therefore , We have 10 A or 1000 It doesn't matter . All the points are in the same cluster at the beginning ：

Now? , In each iteration , We separate the farthest points in the cluster , And repeat the process until each cluster has only one point ：

We're splitting at every step of the way （ Or divide ） cluster , So it's called split hierarchical clustering .

Aggregation clustering is widely used in industry , In this paper, we will also focus on . Once we've got the aggregation , Split hierarchical clustering will also become very simple .

** The steps of hierarchical clustering **

We fuse the most similar points or classes in hierarchical clustering —— We already know that . Now the question is —— How to decide which points are similar and which are not ？ This is one of the most important problems in clustering ！

Here's a way to calculate similarity —— Calculate the distance between cluster centers . The nearest point is considered a similar point , We can fuse them . We can call this a distance based algorithm （ Because we calculated the distance between clusters ）.

In hierarchical clustering, there is a proximity matrix （proximity matrix） The concept of . This matrix stores the distance between each pair of points . Let's use an example to understand the matrix and hierarchical clustering method .

** Example **

Suppose a teacher wants to divide her students into different groups . She has the scores each student gets on an assignment , Based on these scores , She wants to divide the students into different groups . There is no fixed goal for grouping . Because the teacher doesn't know which kind of students should be assigned to which group , It can't be described as a supervised learning problem . therefore , We will use hierarchical clustering to divide students into different groups .

Our examples are 5 A student ：

** Create an adjacency matrix **

First , We create an adjacency matrix , This matrix will tell us the distance between these points . We calculated the distance between every two points , You'll get one n*n The square matrix of （n It's the number of observations ）.

Let's take a look at the adjacency matrix between these five points ：

The diagonal of this matrix is always 0, Because the distance between each point and itself is always 0. We will use the Euclidean distance formula to calculate the remaining distance . therefore , Let's look at the points we want to calculate 1 and 2 Distance between ：

√(10-7)^2 = √9 = 3

Similarly , We can calculate the distance between all the points , And fill in the adjacency matrix .

** step **

First step ： First , We put all the points in a cluster ：

Different colors represent different clusters . You can see in the data 5 Points make up five different clusters .

The second step ： Next , Find the shortest distance point in adjacency matrix , And put these dots together . Then update the adjacency matrix .

ad locum , The minimum distance is 3, So put some 1 and 2 To merge ：

Let's look at the updated clusters and update the adjacency matrix accordingly ：

ad locum , We took two points （7,10） To replace the tag of the cluster . We can also use the minimum or average instead of . Now? , We will calculate the adjacency matrix of these clusters again ：

The third step ： Repeat step 2 Until there is only 1 A cluster of .

Let's look at the smallest value in the adjacency matrix , Then merge the closest pair in the cluster . After repeating the above steps, we will get the following fused clusters ：

We started to have 5 A cluster of , In the end, there's only a single cluster . This is how aggregate hierarchical clustering works . But there are still thorny problems —— How to determine the number of clusters ？ Let's look at the next part .

** In hierarchical clustering , How should we choose the number of clusters ？**

Are you ready to answer the question that has been asked since the beginning ？ In order to obtain the number of hierarchical clusters , We used a wonderful concept called a tree .

** The tree is a tree chart , Be able to record the order of fusion or division .**

Let's go back to the teacher - Examples of students . Whenever you merge two classes , A tree will record the distance between these classes and represent it as a graph . Let's take a look at what the tree looks like ：

We put the sample in x Axis , Distance as y Axis . Whenever two clusters merge , We're all going to join the tree , The height between the connecting points is the distance between these points . Let's create a tree of examples ：

It takes some time to process the above picture . We started to fuse the samples 1 and 2, The distance between these two points is 3（ It refers to the first adjacency matrix that appears in the previous section ）. Let's put it on the tree ：

ad locum , We can look at the fusion sample 1 and 2. The vertical line represents the distance between two points . alike , Let's draw all the steps of merging clusters onto the graph , Finally, we can get the following tree ：

We can clearly visualize the steps of hierarchical clustering . The longer the vertical line is , The further apart the clusters are .

Now? , We can set a distance threshold , And draw a horizontal line （ General , We're going to set thresholds in this way , It cuts off the most common vertical line ）. Let's set the threshold to 12, Then draw a horizontal line .

The number of classes is the number of vertical lines that intersect the threshold first . In the above example , Because the red line crosses two vertical lines , We will have 2 Classes . A class includes samples （1,2,4）, Another class includes samples （3,5）. Very clear, right ?

This is how we use the tree to determine the number of classes in hierarchical clustering . In the next section , We will apply hierarchical clustering to help you understand the concepts learned in this article .

** Using hierarchical clustering to solve the wholesale customer classification problem **

It's time to start Python 了 ！

We're going to start solving a wholesale customer classification problem . You can download datasets here （https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv）.

This data is hosted in UCI Machine learning knowledge base . The goal of this question is to a wholesaler's customers based on their different product types （ Like milk 、 The groceries 、 Area, etc ） The annual expenses are classified .

Let's explore the data first , Then we use hierarchical clustering to classify customers .

First import the required Library ：

view rawimporting_libraries.py hosted with by GitHub

https://mp.weixin.qq.com/cgi-bin/appmsg?t=media/appmsg_edit&action=edit&type=10&appmsgid=100034460&isMul=1&isSend=0&token=886063492&lang=zh_CN#file-importing_libraries-py

Load the dataset and look at the first few lines ：

view raw

https://gist.github.com/PulkitS01/8ac9bf3b54eb59b4e1d4eaa21d3d774e/raw/6cea281dc4dea42bbcb2160e6cef1535cad765e7/reading_data.py

There are many kinds of products here —— fresh 、 milk 、 Groceries and so on . The number represents the quantity purchased by each customer . Our goal is to divide classes from this data , Similar customers can be grouped into the same category . We will use hierarchical clustering to solve this problem .

But before the practical application of hierarchical clustering solution , We need to normalize the data set so that all variables have the same scale . Why is this step important ？ Because if the variable scales are different , The model tends to favor variables with a higher order of magnitude, such as fresh or milk （ The table above ）.

therefore , First normalize the data , Put all the variables on the same scale .

ad locum , You can see that the scales of all variables are almost the same . Now? , We can start hierarchical clustering . First draw a tree to help us determine the number of clusters in this problem ：

X The axis is the sample ,y The axis represents the distance between samples . The vertical line with the greatest distance is the blue line , So we can decide that the threshold is 6, And cut off the tree ：

This line has two intersections , So we have two clusters . Let's use hierarchical clustering ：

In our definition of 2 After a cluster , We can see in the output 0 and 1 Value .0 Represents the value that belongs to the first cluster , and 1 Represents the value belonging to the second cluster . Now visualize the two clusters ：

fantastic ！ We can now see two clusters clearly . This is what we use Python To achieve hierarchical clustering process .

** Write it at the end **

Hierarchical clustering is a very useful way to divide observations . The advantage is that there is no need to predefine the number of clusters , This makes it more than k-Means More advantage .

If you're new to data science , It is highly recommended that you take a practical machine learning course (https://courses.analyticsvidhya.com/courses/applied-machine-learning-beginner-to-professional?utm_source=blog&utm_medium=beginners-guide-hierarchical-clustering). This is one of the most comprehensive end-to-end machine learning courses you can find anywhere . Hierarchical clustering is just one of the many topics covered in the course .

Original title ：

A Beginner’s Guide to Hierarchical Clustering and how to Perform it in Python

Link to the original text ：

https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/

** Introduction to translator ： Chen Chao **, Master of applied psychology, Peking University . I majored in computer science , After that, he made unremitting efforts on the road of psychology . It is increasingly found that data analysis and programming have become two required survival skills , So make every effort to better contact and understand the relevant knowledge in daily life , But the road is long , I'm still on my way .

** from ： Data pie THU official account ;**

**END**

** Copyright notice ： Part of the content of this issue comes from the Internet , Please indicate the original link and author , If there is any infringement or source error, please contact us .**

** Cooperation please add QQ：365242293 **

** Data analysis **（ID : ecshujufenxi ） Internet technology and wechat of data circle itself , It's also WeMedia A member of the We Media Alliance ,WeMedia Alliance coverage 5000 10000 people .