Renesh Bedre 8 minute read

## What is t-Distributed Stochastic Neighbor Embedding (t-SNE)?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a __non-parametric dimensionality reduction technique__ in whichhigh-dimensional data (n features) is mapped into low-dimensional data (typically 2 or 3 features) while preservingrelationship among the data points of original high-dimensional data.

A t-SNE algorithm is an __unsupervised__ machine learning algorithm primarily used for visualizing. Using[scatter plots]((scatter-plot-matplotlib.html), low-dimensional data generated with t-SNE can be visualized easily.

t-SNE is a __probabilistic model__, and it models the probability of neighboring points such that similar sampleswill be placed together and dissimilar samples at greater distances. Hence, t-SNE is helpful in understanding thedata structure and distribution characteristics of the high-dimensional datasets.

Unlike principal component analysis (PCA) which needs linear data, t-SNE work betterwith both __linear and non-linear__ well-clustered datasets and produces more meaningful clustering.

## Why to use t-SNE?

The majority of big data datasets contain __hundreds or thousands of variables__. Due to a large number of variables, itis impractical to visualize the data (even with pairwise scatter plots), so we must use dimensional reductiontechniques to understand their structure and relationships.

As an example, __single-cell RNA-seq (scRNA-seq)__ produces the expression data for thousands of genes and millions of cellsin bioinformatics analysis. To understand biologically meaningful cluster structures, such high-dimensional datasetsmust be analyzed and visualized.

Interpreting such high-dimensional non-linear data would be impractical without transforming them into low-dimensional data. Usingdimension reduction techniques such as t-SNE, high-dimensional datasets can be reduced into two-dimensional space forvisualization and understanding biologically meaningful clusters present in high-dimensional datasets.

## How to perform t-SNE in Python

In Python, t-SNE analysis and visualization can be performed using the `TSNE()`

function from `scikit-learn`

and `bioinfokit`

packages.

Here, I will use the scRNA-seq dataset for visualizing the hiddenbiological clusters. I have downloaded the subset of scRNA-seq dataset of *Arabidopsis thaliana* root cells processedby 10x genomics Cell Ranger pipeline

This scRNA-seq dataset contains 4406 cells with ~75K sequence reads per cells. This dataset is pre-processed usingSeurat R package and only used 2000 highly variable genes(variables or features) for t-SNE visualization.

Now, import the pre-processed scRNA-seq data using `get_data()`

function from `bioinfokit`

package. If you have yourown dataset, you should import it as a pandas dataframe.

`# import scRNA-seq as pandas dataframefrom bioinfokit.analys import get_datadf = get_data('ath_root').datadf = df.set_index(df.columns[0])dft = df.Tdft = dft.set_index(dft.columns[0])dft.head(2)# outputgene AT1G01070 RPP1A HTR12 AT1G01453 ADF10 PLIM2B SBTI1.1 GL22 GPAT2 AT1G02570 BXL2 IMPA6 ... PER72 RAB18 AT5G66440 AT5G66580 AT5G66590 AT5G66800 AT5G66815 AT5G66860 AT5G66985 IRX14H PER73 RPL26BAAACCTGAGACAGACC-1 0.51 1.40 -0.26 -0.28 -0.24 -0.14 -0.13 -0.07 -0.29 -0.31 -0.23 0.66 ... -0.25 0.64 0.61 -0.55 -0.41 -0.43 2.01 3.01 -0.24 -0.18 -0.34 1.16AAACCTGAGATCCGAG-1 -0.22 1.36 -0.26 -0.28 -0.60 -0.51 -0.13 -0.07 -0.29 -0.31 0.81 -0.31 ... -0.25 1.25 -0.48 -0.55 -0.41 -0.43 -0.24 0.89 -0.24 -0.18 -0.49 -0.68# check the dimension (rows, columns)dft.shape# output(4406, 2000)`

As there is a very large number of variables (2000), we will use another dimension reduction technique such as PCA toreduce the number of variables to a reasonable number (e.g. 20 to 50) for t-SNE.

Note: The PCA is a recommended method to reduce the number of input features(when there are a large number of features) to a reasonable number (e.g. 20 to 50) to speed up the t-SNE computationtime and suppresses the noisy data points.

`from sklearn.decomposition import PCAimport pandas as pd# perform PCApca_scores = PCA().fit_transform(dft)# create a dataframe of pca_scoresdf_pc = pd.DataFrame(pca_scores)`

Now, perform the t-SNE on the first 50 features obtained from the PCA. By default, `TSNE()`

function uses the __Barnes-Hutapproximation__, which is computationally less intensive.

`# perform t-SNE on PCs scores# we will use first 50 PCs but this can varyfrom sklearn.manifold import TSNEtsne_em = TSNE(n_components = 2, perplexity = 30.0, early_exaggeration = 12, n_iter = 1000, learning_rate = 368, verbose = 1).fit_transform(df_pc.loc[:,0:49])# output[t-SNE] Computing 91 nearest neighbors...[t-SNE] Indexed 4406 samples in 0.081s...[t-SNE] Computed neighbors for 4406 samples in 1.451s...[t-SNE] Computed conditional probabilities for sample 1000 / 4406[t-SNE] Computed conditional probabilities for sample 2000 / 4406[t-SNE] Computed conditional probabilities for sample 3000 / 4406[t-SNE] Computed conditional probabilities for sample 4000 / 4406[t-SNE] Computed conditional probabilities for sample 4406 / 4406[t-SNE] Mean sigma: 4.812347[t-SNE] KL divergence after 250 iterations with early exaggeration: 64.164688[t-SNE] KL divergence after 1000 iterations: 0.840337# here you can run TSNE multiple times to keep run with lowest KL divergence`

Note: t-SNE is a stochastic method and produces slightly different embeddings if run multiple times. t-SNE can be runseveral times to get the embeddings with the smallest Kullback–Leibler (KL) divergence. The run with the smallest KL couldhave the greatest variation.

You have run the t-SNE to obtain a run with smallest KL divergenece.

In t-SNE, several parameters needs to be optimized (hyperparameter tuning) for building the effective model.

*perplexity* is the most important parameter in t-SNE, and it measures the effective number of neighbors. The number of variables in theoriginal high-dimensional data determines the perplexity parameter (standard range 10-100). In case of large, datasets,keeping large perplexity parameter (n/100; where n is the number of observations) is helpful for preserving the global geometry.

In addition to the perplexity parameter, other parameters such as the number of iterations (*n_iter*), learning rate(set n/12 or 200 whichever is greater), and early exaggeration factor (*early_exaggeration*) can also affect thevisualization and should be optimized for larger datasets (Kobak et al., 2019).

Now, visualize the t-SNE clusters,

`# plot t-SNE clustersfrom bioinfokit.visuz import clustercluster.tsneplot(score=tsne_em)# plot will be saved in same directory (tsne_2d.png) `

Generated t-SNE plot,

As t-SNE is an unsupervised learning method, we do not have sample target information. Hence, I will recognize theclusters using the DBSCAN clustering algorithm. This will help to color and visualize clusters ofsimilar data points

`from sklearn.cluster import DBSCAN# here eps parameter is very important and optimizing eps is essential# for well defined clusters. I have run DBSCAN with several eps values# and got good clusters with eps=3get_clusters = DBSCAN(eps = 3, min_samples = 10).fit_predict(tsne_em)# check unique clusters# -1 value represents noisy points could not assigned to any clusterset(get_clusters)# output{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -1}# get t-SNE plot with colors assigned to each clustercluster.tsneplot(score=tsne_em, colorlist=get_clusters, colordot=('#713e5a', '#63a375', '#edc79b', '#d57a66', '#ca6680', '#395B50', '#92AFD7', '#b0413e', '#4381c1', '#736ced', '#631a86', '#de541e', '#022b3a', '#000000'), legendpos='upper right', legendanchor=(1.15, 1))`

Generated t-SNE plot,

In t-SNE scatter plot, the points within the individual clusters are highly similar to each other and in distant topoints in other clusters. The same pattern likely holds in a high-dimensional original dataset. In the context ofscRNA-seq, these clusters represent the cells types with similar transcriptional profiles.

## Advantages of t-SNE

**No assumption of linearity**:t-SNE does not assume any relationship in the input features and it can be applied toboth linear and non-linear datasets**Preserves local structure**: t-SNE preserves the structure of high-dimensional datasets i.e. the close points remain closer and distant pointsremain distant in low-dimensional space.**Non-parametric method**: t-SNE is a non-parametric machine learning method

## Disadvantages of t-SNE

**t-SNE is slow**: t-SNE is a computationally intensive technique and takes longer time on larger datasets. Hence, it is recommended to usethe PCA method prior to t-SNE if the original datasets contain a very large number of input features. You should considerusing UMAP dimension reduction method) for faster run time performance on larger datasets.**t-SNE is a stochastic method**: t-SNE is a stochastic method and produces slightly different embeddings if run multiple times. These different resultscould affect the numeric values on the axis but do not affect the clustering of the points. Therefore, t-SNE can berun several times to get the embeddings with the smallest Kullback–Leibler (KL) divergence.**t-SNE does not preserve global geometry**: While t-SNE is good at visualizing the well-separated clusters, most of the time it fails to preserve theglobal geometry of the data. Hence, t-SNE is not recommended for classification purposes.**Hyperparameter optimization**: t-SNE has various parameters to optimize to obtain well fitted model

## Enhance your skills with courses on genomics and bioinformatics

- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies

## Enhance your skills with courses on machine learning

- Advanced Learning Algorithms
- Machine Learning Specialization
- Machine Learning with Python
- Machine Learning for Data Analysis
- Supervised Machine Learning: Regression and Classification
- Unsupervised Learning, Recommenders, Reinforcement Learning
- Deep Learning Specialization
- AI For Everyone
- AI in Healthcare Specialization
- Cluster Analysis in Data Mining

## References

- Maaten LV, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(Nov):2579-605.
- Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature communications. 2019 Nov 28;10(1):1-4.
- Cieslak MC, Castelfranco AM, Roncalli V, Lenz PH, Hartline DK. t-Distributed Stochastic Neighbor Embedding (t-SNE): A toolfor eco-physiological transcriptomic analysis. Marine Genomics. 2019 Nov 26:100723.
- Rich-Griffin C, Stechemesser A, Finch J, Lucas E, Ott S, Schäfer P. Single-cell transcriptomics: a high-resolutionavenue for plant functional genomics. Trends in plant science. 2020 Feb 1;25(2):186-97.
- Devassy BM, George S. Dimensionality reduction and visualisation of hyperspectral ink data Using t-SNE. ForensicScience International. 2020 Feb 12:110194.
- Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y. Fast interpolation-based t-SNE for improved visualizationof single-cell RNA-seq data. Nature methods. 2019 Mar;16(3):243-5.
- Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across differentconditions, technologies, and species. Nature biotechnology. 2018 May;36(5):411-20.
- Ryu KH, Huang L, Kang HM, Schiefelbein J. Single-cell RNA sequencing resolves molecular relationships among individualplant cells. Plant physiology. 2019 Apr 1;179(4):1444-56.
- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit Recognition,MSc Thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University.

This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase.The retailer will pay the commission at no additional cost to you.