t-SNE in Python for visualization of high-dimensional data (2024)

Renesh Bedre 8 minute read

t-SNE in Python for visualization of high-dimensional data (1)

What is t-Distributed Stochastic Neighbor Embedding (t-SNE)?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-parametric dimensionality reduction technique in whichhigh-dimensional data (n features) is mapped into low-dimensional data (typically 2 or 3 features) while preservingrelationship among the data points of original high-dimensional data.

A t-SNE algorithm is an unsupervised machine learning algorithm primarily used for visualizing. Using[scatter plots]((scatter-plot-matplotlib.html), low-dimensional data generated with t-SNE can be visualized easily.

t-SNE is a probabilistic model, and it models the probability of neighboring points such that similar sampleswill be placed together and dissimilar samples at greater distances. Hence, t-SNE is helpful in understanding thedata structure and distribution characteristics of the high-dimensional datasets.

Unlike principal component analysis (PCA) which needs linear data, t-SNE work betterwith both linear and non-linear well-clustered datasets and produces more meaningful clustering.

Why to use t-SNE?

The majority of big data datasets contain hundreds or thousands of variables. Due to a large number of variables, itis impractical to visualize the data (even with pairwise scatter plots), so we must use dimensional reductiontechniques to understand their structure and relationships.

As an example, single-cell RNA-seq (scRNA-seq) produces the expression data for thousands of genes and millions of cellsin bioinformatics analysis. To understand biologically meaningful cluster structures, such high-dimensional datasetsmust be analyzed and visualized.

Interpreting such high-dimensional non-linear data would be impractical without transforming them into low-dimensional data. Usingdimension reduction techniques such as t-SNE, high-dimensional datasets can be reduced into two-dimensional space forvisualization and understanding biologically meaningful clusters present in high-dimensional datasets.

How to perform t-SNE in Python

In Python, t-SNE analysis and visualization can be performed using the TSNE() function from scikit-learnand bioinfokit packages.

Here, I will use the scRNA-seq dataset for visualizing the hiddenbiological clusters. I have downloaded the subset of scRNA-seq dataset of Arabidopsis thaliana root cells processedby 10x genomics Cell Ranger pipeline

This scRNA-seq dataset contains 4406 cells with ~75K sequence reads per cells. This dataset is pre-processed usingSeurat R package and only used 2000 highly variable genes(variables or features) for t-SNE visualization.

Now, import the pre-processed scRNA-seq data using get_data() function from bioinfokit package. If you have yourown dataset, you should import it as a pandas dataframe.

# import scRNA-seq as pandas dataframefrom bioinfokit.analys import get_datadf = get_data('ath_root').datadf = df.set_index(df.columns[0])dft = df.Tdft = dft.set_index(dft.columns[0])dft.head(2)# outputgene AT1G01070 RPP1A HTR12 AT1G01453 ADF10 PLIM2B SBTI1.1 GL22 GPAT2 AT1G02570 BXL2 IMPA6 ... PER72 RAB18 AT5G66440 AT5G66580 AT5G66590 AT5G66800 AT5G66815 AT5G66860 AT5G66985 IRX14H PER73 RPL26BAAACCTGAGACAGACC-1 0.51 1.40 -0.26 -0.28 -0.24 -0.14 -0.13 -0.07 -0.29 -0.31 -0.23 0.66 ... -0.25 0.64 0.61 -0.55 -0.41 -0.43 2.01 3.01 -0.24 -0.18 -0.34 1.16AAACCTGAGATCCGAG-1 -0.22 1.36 -0.26 -0.28 -0.60 -0.51 -0.13 -0.07 -0.29 -0.31 0.81 -0.31 ... -0.25 1.25 -0.48 -0.55 -0.41 -0.43 -0.24 0.89 -0.24 -0.18 -0.49 -0.68# check the dimension (rows, columns)dft.shape# output(4406, 2000)

As there is a very large number of variables (2000), we will use another dimension reduction technique such as PCA toreduce the number of variables to a reasonable number (e.g. 20 to 50) for t-SNE.

Note: The PCA is a recommended method to reduce the number of input features(when there are a large number of features) to a reasonable number (e.g. 20 to 50) to speed up the t-SNE computationtime and suppresses the noisy data points.

from sklearn.decomposition import PCAimport pandas as pd# perform PCApca_scores = PCA().fit_transform(dft)# create a dataframe of pca_scoresdf_pc = pd.DataFrame(pca_scores)

Now, perform the t-SNE on the first 50 features obtained from the PCA. By default, TSNE() function uses the Barnes-Hutapproximation, which is computationally less intensive.

# perform t-SNE on PCs scores# we will use first 50 PCs but this can varyfrom sklearn.manifold import TSNEtsne_em = TSNE(n_components = 2, perplexity = 30.0, early_exaggeration = 12, n_iter = 1000, learning_rate = 368, verbose = 1).fit_transform(df_pc.loc[:,0:49])# output[t-SNE] Computing 91 nearest neighbors...[t-SNE] Indexed 4406 samples in 0.081s...[t-SNE] Computed neighbors for 4406 samples in 1.451s...[t-SNE] Computed conditional probabilities for sample 1000 / 4406[t-SNE] Computed conditional probabilities for sample 2000 / 4406[t-SNE] Computed conditional probabilities for sample 3000 / 4406[t-SNE] Computed conditional probabilities for sample 4000 / 4406[t-SNE] Computed conditional probabilities for sample 4406 / 4406[t-SNE] Mean sigma: 4.812347[t-SNE] KL divergence after 250 iterations with early exaggeration: 64.164688[t-SNE] KL divergence after 1000 iterations: 0.840337# here you can run TSNE multiple times to keep run with lowest KL divergence

Note: t-SNE is a stochastic method and produces slightly different embeddings if run multiple times. t-SNE can be runseveral times to get the embeddings with the smallest Kullback–Leibler (KL) divergence. The run with the smallest KL couldhave the greatest variation.

You have run the t-SNE to obtain a run with smallest KL divergenece.

In t-SNE, several parameters needs to be optimized (hyperparameter tuning) for building the effective model.

perplexity is the most important parameter in t-SNE, and it measures the effective number of neighbors. The number of variables in theoriginal high-dimensional data determines the perplexity parameter (standard range 10-100). In case of large, datasets,keeping large perplexity parameter (n/100; where n is the number of observations) is helpful for preserving the global geometry.

In addition to the perplexity parameter, other parameters such as the number of iterations (n_iter), learning rate(set n/12 or 200 whichever is greater), and early exaggeration factor (early_exaggeration) can also affect thevisualization and should be optimized for larger datasets (Kobak et al., 2019).

Now, visualize the t-SNE clusters,

# plot t-SNE clustersfrom bioinfokit.visuz import clustercluster.tsneplot(score=tsne_em)# plot will be saved in same directory (tsne_2d.png) 

Generated t-SNE plot,

t-SNE in Python for visualization of high-dimensional data (2)

As t-SNE is an unsupervised learning method, we do not have sample target information. Hence, I will recognize theclusters using the DBSCAN clustering algorithm. This will help to color and visualize clusters ofsimilar data points

from sklearn.cluster import DBSCAN# here eps parameter is very important and optimizing eps is essential# for well defined clusters. I have run DBSCAN with several eps values# and got good clusters with eps=3get_clusters = DBSCAN(eps = 3, min_samples = 10).fit_predict(tsne_em)# check unique clusters# -1 value represents noisy points could not assigned to any clusterset(get_clusters)# output{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -1}# get t-SNE plot with colors assigned to each clustercluster.tsneplot(score=tsne_em, colorlist=get_clusters, colordot=('#713e5a', '#63a375', '#edc79b', '#d57a66', '#ca6680', '#395B50', '#92AFD7', '#b0413e', '#4381c1', '#736ced', '#631a86', '#de541e', '#022b3a', '#000000'), legendpos='upper right', legendanchor=(1.15, 1))

Generated t-SNE plot,

t-SNE in Python for visualization of high-dimensional data (3)

In t-SNE scatter plot, the points within the individual clusters are highly similar to each other and in distant topoints in other clusters. The same pattern likely holds in a high-dimensional original dataset. In the context ofscRNA-seq, these clusters represent the cells types with similar transcriptional profiles.

Advantages of t-SNE

  • No assumption of linearity:t-SNE does not assume any relationship in the input features and it can be applied toboth linear and non-linear datasets
  • Preserves local structure: t-SNE preserves the structure of high-dimensional datasets i.e. the close points remain closer and distant pointsremain distant in low-dimensional space.
  • Non-parametric method: t-SNE is a non-parametric machine learning method

Disadvantages of t-SNE

  • t-SNE is slow: t-SNE is a computationally intensive technique and takes longer time on larger datasets. Hence, it is recommended to usethe PCA method prior to t-SNE if the original datasets contain a very large number of input features. You should considerusing UMAP dimension reduction method) for faster run time performance on larger datasets.
  • t-SNE is a stochastic method: t-SNE is a stochastic method and produces slightly different embeddings if run multiple times. These different resultscould affect the numeric values on the axis but do not affect the clustering of the points. Therefore, t-SNE can berun several times to get the embeddings with the smallest Kullback–Leibler (KL) divergence.
  • t-SNE does not preserve global geometry: While t-SNE is good at visualizing the well-separated clusters, most of the time it fails to preserve theglobal geometry of the data. Hence, t-SNE is not recommended for classification purposes.
  • Hyperparameter optimization: t-SNE has various parameters to optimize to obtain well fitted model

Enhance your skills with courses on genomics and bioinformatics

Enhance your skills with courses on machine learning

References

  • Maaten LV, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(Nov):2579-605.
  • Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature communications. 2019 Nov 28;10(1):1-4.
  • Cieslak MC, Castelfranco AM, Roncalli V, Lenz PH, Hartline DK. t-Distributed Stochastic Neighbor Embedding (t-SNE): A toolfor eco-physiological transcriptomic analysis. Marine Genomics. 2019 Nov 26:100723.
  • Rich-Griffin C, Stechemesser A, Finch J, Lucas E, Ott S, Schäfer P. Single-cell transcriptomics: a high-resolutionavenue for plant functional genomics. Trends in plant science. 2020 Feb 1;25(2):186-97.
  • Devassy BM, George S. Dimensionality reduction and visualisation of hyperspectral ink data Using t-SNE. ForensicScience International. 2020 Feb 12:110194.
  • Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y. Fast interpolation-based t-SNE for improved visualizationof single-cell RNA-seq data. Nature methods. 2019 Mar;16(3):243-5.
  • Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across differentconditions, technologies, and species. Nature biotechnology. 2018 May;36(5):411-20.
  • Ryu KH, Huang L, Kang HM, Schiefelbein J. Single-cell RNA sequencing resolves molecular relationships among individualplant cells. Plant physiology. 2019 Apr 1;179(4):1444-56.
  • C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit Recognition,MSc Thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University.

This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase.The retailer will pay the commission at no additional cost to you.

t-SNE in Python for visualization of high-dimensional data (2024)

FAQs

How do you Visualise high-dimensional data? ›

Parallel Coordinates

Parallel coordinates are a common way of visualizing high-dimensional data. Each feature is represented as a vertical axis, and each data point is represented as a line that intersects each axis at the corresponding feature value.

How to visualize high-dimensional embeddings? ›

t-SNE is a dimensionality reduction algorithm which is often used for visualization. It learns a mapping from a set of high-dimensional vectors, to a space with a smaller number of dimensions (usually 2), which is hopefully a good representation of the high-dimensional space.

What type of data will sne results be visualised? ›

In this blog post, we have learned about t-SNE, a popular dimensionality reduction technique that can visualize high-dimensional non-linear data in a low-dimensional space.

Why do we say t-SNE is doing a better job in reducing the high-dimensional image data? ›

T-distributed neighbor embedding (t-SNE) is a dimensionality reduction technique that helps users visualize high-dimensional data sets. It takes the original data that is entered into the algorithm and matches both distributions to determine how to best represent this data using fewer dimensions.

How to visualize high-dimensional data in Python? ›

Visualising high-dimensional data using PCA and Plotly
  1. import numpy as np import pandas as pd import plotly.express as px import matplotlib.pyplot as plt from sklearn.preprocessing import MinMaxScaler from sklearn.decomposition import PCA %matplotlib inline.
  2. #Importing iris input features iris = pd.

How to visualise higher dimensions? ›

Using number lines to represent axes, we can plot a certain point on a shape regardless of how many dimensions its in. From there, 3Blue1Brown explains how this idea can reveal truths about shapes in higher dimensions.

Why is t-SNE used for visualizing high-dimensional data? ›

t-SNE is capable of capturing much of the local structure of the high-dimensional data very well, while also revealing global structure such as the presence of clusters at several scales.

Is t-SNE better than PCA? ›

PCA tries to reduce dimensionality by maximizing variance in the data while t-SNE tries to do the same by keeping similar data points together (and dissimilar data points apart) in both higher and lower dimensions. Because of these reasons, t-SNE can easily outperform PCA in dimensionality reduction.

When to use t-SNE? ›

The purpose of using t-SNE is to reduce the dimensionality of high-dimensional data while preserving its local structure, making it easier to visualize and identify patterns or clusters in lower dimensions.

Why are you using t-SNE wrong? ›

The biggest mistake people make with t-SNE is only using one value for perplexity and not testing how the results change with other values. If choosing different values between 5 and 50 significantly change your interpretation of the data, then you should consider other ways to visualize or validate your hypothesis.

What are the limitations of t-SNE? ›

Limitations. Computational Complexity: t-SNE involves complex calculations as it calculates the pairwise conditional probability for each point. Due to this, it takes more time as the number of data points increases.

What is t-SNE in Python? ›

T-distributed Stochastic Neighbor Embedding. t-SNE [1] is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

How do you handle high-dimensional data? ›

Some common strategies for handling high dimensional data:
  1. Feature selection: Select the most relevant subset of features that contains the core signal. ...
  2. Feature extraction: Use dimensionality reduction techniques like PCA to construct new features that consolidate the most important information.
Feb 3, 2024

What is a common technique used to represent high-dimensional text data? ›

In machine learning, embeddings are a technique used to represent complex, high-dimensional data like words, sentences, or even entire documents in a more manageable, lower-dimensional space.

How are you going to visualize multidimensional data? ›

There are two categories of Multidimensional Visualizations. The first looks at category proportions, or category counts. The second examines the relationships between the variables. Examples of visualizations that show category proportions or counts: pie chart, Wordles, bar chart, histogram, rank plot, tree map.

How do you visualize dense data? ›

Best practices for visualizing high-density data
  1. Clustering.
  2. Heat maps.
  3. Transparency.
  4. Bloom.
  5. Aggregation.
  6. Visible scale range.

References

Top Articles
Top 10 Games To Check Out - Prep Redzone
Things to know on your visit to Ohio Stadium this year
Minus8 Patreon
Norris Funeral Home Chatham Va Obituaries
Fnv Mr Cuddles
Uber Hertz Marietta
"Rainbow Family" will im Harz bleiben: Hippie-Camp bis Anfang September geplant
Osu Bookstore Stillwater
Rice explains personal reason for subdued goal celebration against Ireland
Chs.mywork
Real Estate Transfers Erie Pa
5 high school boys cross country stars of the week: Sept. 13 edition
Bigbug Rotten Tomatoes
Video Program: Intermediate Rumba
Dabs Utah State Liquor Store #09 - Murray
Sinfuldeeds Pt 2
Creigs List Maine
Solar Smash Secret Achievements List 2023
Sufficient Velocity Quests
Ghostbusters Afterlife 123Movies
Dollar Tree Hours Saturday
Poe Poison Srs
Wdl Nursing Abbreviation
Zwei-Faktor-Authentifizierung (2FA) für Ihre HubSpot-Anmeldung einrichten
Class B Permit Jobs
Jesus Revolution (2023)
Dumb Money Showtimes Near Showcase Cinema De Lux Legacy Place
8663081159
German American Bank Owenton Ky
Goodwoods British Market Friendswood
Does Wanda Sykes Use A Cane
Sems Broward County
Mcdonald's Near Me Dine In
Craigslist Cars And Trucks By Owner Seattle
Trivago Hotels Austin
6030 Topsail Rd, Lady Lake, FL 32159 - MLS G5087027 - Coldwell Banker
Anker GaNPrime™️ | Our Best Multi-Device Fast Charging Lineup
Ucf Net Price Calculator
Sessional Dates U Of T
Nashville Predators Wiki
Weather Underground Pewaukee
Parx Entries For Today
A Man Called Otto Showtimes Near Carolina Mall Cinema
Personapay/Glens Falls Hospital
Craigslist Lasalle County Il
Oppenheimer Showtimes Near B&B Theatres Liberty Cinema 12
Eurorack Cases & Skiffs
Craigslist Farm And Garden Lexington
FINAL FANTASY XI Online 20th Anniversary | Square Enix Blog
Academic calendar: year cycle and holidays | University of Twente | Service Portal
Omaha World-Herald from Omaha, Nebraska
Cardaras Logan Ohio
Latest Posts
Article information

Author: Eusebia Nader

Last Updated:

Views: 5585

Rating: 5 / 5 (60 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Eusebia Nader

Birthday: 1994-11-11

Address: Apt. 721 977 Ebert Meadows, Jereville, GA 73618-6603

Phone: +2316203969400

Job: International Farming Consultant

Hobby: Reading, Photography, Shooting, Singing, Magic, Kayaking, Mushroom hunting

Introduction: My name is Eusebia Nader, I am a encouraging, brainy, lively, nice, famous, healthy, clever person who loves writing and wants to share my knowledge and understanding with you.