Beginner’s Guide to apply t-SNE in your high dimensional dataset.
Make your high dimension and gain the best insights for better predictions.
In this article, you’ll learn how to apply t-Distributed Stochastic Neighbor Embedding or t-SNE. While this may sound scary, it’s just a powerful technique to visualize high dimensional data using Feature Extraction. t-SNE will maximize distance in two-dimensional space between observations that are most different in a high dimensional space. Because of this, observations that are similar be close to one another and may become clustered.
One example of t-SNE applied to the Iris dataset -
We can see that setosa species form a separate cluster, while the other two are closer together and therefore more similar. However, the Iris dataset only has 4 dimensions to start with, so let’s try this on a more challenging dataset.
We will use the ANSUR male body dataset measurement dataset which has 99 dimensions. Before we apply t-SNE we’re going to remove all the non-numeric columns from the dataset by passing a list with the unwanted column names to the Pandas drop method. TSNE() doesn’t work with non-numeric data as such. We could use a trick like one-hot encoding to get around this but we’ll be using a different approach here.
We’ll create a TSNE() model with a learning rate of 50. While fitting to the dataset, t-SNE will evaluate these with an internal cost function. A high learning rate will cause the function to be more adventurous it tries out while low learning rates will cause it to be conservative. Usually, learning rates will fall in the 10 to 1000 range.
Next, we will fit and transform the TSNE model to our numeric dataset. This will project our high — dimensional dataset onto a NumPy array with two dimensions.
We’ll assign these two dimensions back to our original dataset naming them ‘x’ and ‘y’ We can now start plotting this data using seaborn’s scatterplot() method on the x and y columns we just added.
The resulting plot shows one big cluster, and in a sense, this could have been expected. There are no distinct groups of male body shapes with little in between, instead there is a more continuous distribution of body shapes, and thus, one big cluster.
However, using the categorical features we excluded from the analysis, we can check if there are interesting structural patterns within the cluster.
The Body Mass Index or BMI is a method to categorize people into weight groups regardless of their height. If we use this name for the hue, which is the color, of the seaborn scatterplot we’ll be able to see that weight class indeed shows and interesting pattern.
sns.scatterplot(x='x', y='y', hue='BMI_class', data=df, alpha=0.6)
plt.show()
From the 90+ features in the dataset, TSNE picked up that weight explains a lot of variance in the dataset and used that to spread out points along the x-axis, with underweight people on the left and overweight people on the right.
If we use the ‘Height_class’ to control the hue of the points we’ll be able to see that in the vertical direction, the variance is explained by a person’s height.
sns.scatterplot(x='x', y='y', hue='Height_class', data=df, alpha=0.6)
plt.show()
Tall people are at the top of the plot and shorter people are at the bottom.
In, conclusion t-SNE helped us to visually explore our dataset and identify the most important drivers of variance in body shapes.