Customer Segmentation using KMeans
Created by Chia, Jonathan on Apr 09, 2022
Notes on cluster creation for behavioral customer segmentation
The Goal
========
Create distinguishable clusters of customers. Each cluster of customers should have different behaviors.
Because there are so many different behaviors, segmenting customers manually would be very complex. Let's say we segment them by what products they purchase. We may end up with a lot of easy to name groups, but the problem we would find is that two different brands might have very similar behaviors because these brands appeal to new customers.
The reason we use machine learning is because the algorithm can use math to find groups that are distinct and separate from each other in terms of behaviors. Kmeans essentially plots all the different customers in a multi-dimensional space, and then finds these groups.
One big problem we have is that we have so many variables that we start to have dimensionality problems (curse of dimensionality); there are some variables that are not very useful, some variables that create noise, and these extra variables make it much harder for the algorithm to work efficiently and accurately.
PCA
===
Principal Component Analysis helps to solve the problems we face with multi-dimensional data. It can reduce noise, reduce non-essential variables, and still preserve most of the variance.
Determining if PCA is needed
I don't remember what the best way is to decide, but you can run Kmeans with PCA and without it and then compare the results. It all depends on your data and your problem. In our case, we tried it with PCA and we liked the results we got from it.
Normalizing before running PCA
PCA decides on dimensionality reduction based on variance. Make sure everything is scaled from 0 to 1 so that each variable is weighted equally
Finding the right number of PCA columns to use
Permutation Test:
Check to make sure the variance in each column is not due to random chance
Null hypothesis: simulate the null hypothesis by shuffling columns so that each row (each point in the dimensional space) is random. Thus the PCA we get is now a random null distribution for if the data is random.
Alternative hypothesis: If we reject the null hypothesis then we know that the PCA we found was from real variance not random variance
K-Means
=======
See the below link for a visual explanation of the algorithm:
https://medium.com/dataseries/k-means-clustering-explained-visually-in-5-minutes-b900cc69d175
Finding the right K
Remember, the goal is to create distinguishable clusters. The more clusters we can make the better (so a higher k is generally better), but we have to be careful. With a higher number of groups, the groups can become less distinguishable. For example, let's say we have 4 groups, and groups 1 and 2 are pretty close to each other. Let's say we run it with 5 groups, and the 5th group takes some from group 1 and some from group 2. Now groups 1, 2, and 5 are all really close, with 5 bridging 1 and 2. These groups may not be as distinguishable now.
Testing if the algorithm made distinguishable enough clusters
If the clusters are distinguishable enough, a basic algorithm such as a decision tree should be able to see clustered data, learn the differences in behavior, and then 'replicate' the clustering algorithm.
Let's look at the code for a better explanation of this 'replication':
First we cluster the data and assign labels to all the customers
Then we split the data into a training and testing set
Finally, we run a decision tree on the training data
If the clusters are distinguishable enough, the decision tree should be able to learn the differences in the clusters. To check if the decision tree learned well, we then show the test data (without the cluster labels) to the decision tree. The decision tree then assigns the labels to the test data.
We then use the confusion matrix to compare the test data's actual labels versus the decision tree's assigned labels.
If accuracy is very high, then we know that each cluster was different enough that the decision tree didn't get confused when assigning cluster labels to customers.
Document generated by Confluence on Apr 09, 2022 16:54
Last updated