Look for the bare necessities

The simple bare necessities

Forget about your worries and your strife

I mean the bare necessities

Old Mother Nature’s recipes

That bring the bare necessities of life

– Baloo’s song [The Jungle Book]

** **

Manifold learning is an approach to non-linear dimensionality reduction.Algorithms for this task are based on the idea that the dimensionality ofmany data sets is only artificially high.

## 2.2.1. Introduction#

High-dimensional datasets can be very difficult to visualize. While datain two or three dimensions can be plotted to show the inherentstructure of the data, equivalent high-dimensional plots are much lessintuitive. To aid visualization of the structure of a dataset, thedimension must be reduced in some way.

The simplest way to accomplish this dimensionality reduction is by takinga random projection of the data. Though this allows some degree ofvisualization of the data structure, the randomness of the choice leaves muchto be desired. In a random projection, it is likely that the moreinteresting structure within the data will be lost.

** **

To address this concern, a number of supervised and unsupervised lineardimensionality reduction frameworks have been designed, such as PrincipalComponent Analysis (PCA), Independent Component Analysis, LinearDiscriminant Analysis, and others. These algorithms define specificrubrics to choose an “interesting” linear projection of the data.These methods can be powerful, but often miss important non-linearstructure in the data.

** **

Manifold Learning can be thought of as an attempt to generalize linearframeworks like PCA to be sensitive to non-linear structure in data. Thoughsupervised variants exist, the typical manifold learning problem isunsupervised: it learns the high-dimensional structure of the datafrom the data itself, without the use of predetermined classifications.

Examples

See Manifold learning on handwritten digits: Locally Linear Embedding, Isomap… for an example ofdimensionality reduction on handwritten digits.

See Comparison of Manifold Learning methods for an example ofdimensionality reduction on a toy “S-curve” dataset.

The manifold learning implementations available in scikit-learn aresummarized below

## 2.2.2. Isomap#

One of the earliest approaches to manifold learning is the Isomapalgorithm, short for Isometric Mapping. Isomap can be viewed as anextension of Multi-dimensional Scaling (MDS) or Kernel PCA.Isomap seeks a lower-dimensional embedding which maintains geodesicdistances between all points. Isomap can be performed with the objectIsomap.

## Complexity#

The Isomap algorithm comprises three stages:

**Nearest neighbor search.**Isomap usesBallTree for efficient neighbor search.The cost is approximately \(O[D \log(k) N \log(N)]\), for \(k\)nearest neighbors of \(N\) points in \(D\) dimensions.**Shortest-path graph search.**The most efficient known algorithmsfor this are*Dijkstra’s Algorithm*, which is approximately\(O[N^2(k + \log(N))]\), or the*Floyd-Warshall algorithm*, whichis \(O[N^3]\). The algorithm can be selected by the user withthe`path_method`

keyword of`Isomap`

. If unspecified, the codeattempts to choose the best algorithm for the input data.**Partial eigenvalue decomposition.**The embedding is encoded in theeigenvectors corresponding to the \(d\) largest eigenvalues of the\(N \times N\) isomap kernel. For a dense solver, the cost isapproximately \(O[d N^2]\). This cost can often be improved usingthe`ARPACK`

solver. The eigensolver can be specified by the userwith the`eigen_solver`

keyword of`Isomap`

. If unspecified, thecode attempts to choose the best algorithm for the input data.

The overall complexity of Isomap is\(O[D \log(k) N \log(N)] + O[N^2(k + \log(N))] + O[d N^2]\).

\(N\) : number of training data points

\(D\) : input dimension

\(k\) : number of nearest neighbors

\(d\) : output dimension

References

“A global geometric framework for nonlinear dimensionality reduction”Tenenbaum, J.B.; De Silva, V.; & Langford, J.C. Science 290 (5500)

## 2.2.3. Locally Linear Embedding#

Locally linear embedding (LLE) seeks a lower-dimensional projection of the datawhich preserves distances within local neighborhoods. It can be thoughtof as a series of local Principal Component Analyses which are globallycompared to find the best non-linear embedding.

Locally linear embedding can be performed with functionlocally_linear_embedding or its object-oriented counterpartLocallyLinearEmbedding.

## Complexity#

The standard LLE algorithm comprises three stages:

**Nearest Neighbors Search**. See discussion under Isomap above.**Weight Matrix Construction**. \(O[D N k^3]\).The construction of the LLE weight matrix involves the solution of a\(k \times k\) linear equation for each of the \(N\) localneighborhoods.**Partial Eigenvalue Decomposition**. See discussion under Isomap above.

The overall complexity of standard LLE is\(O[D \log(k) N \log(N)] + O[D N k^3] + O[d N^2]\).

\(N\) : number of training data points

\(D\) : input dimension

\(k\) : number of nearest neighbors

\(d\) : output dimension

References

“Nonlinear dimensionality reduction by locally linear embedding”Roweis, S. & Saul, L. Science 290:2323 (2000)

## 2.2.4. Modified Locally Linear Embedding#

One well-known issue with LLE is the regularization problem. When the numberof neighbors is greater than the number of input dimensions, the matrixdefining each local neighborhood is rank-deficient. To address this, standardLLE applies an arbitrary regularization parameter \(r\), which is chosenrelative to the trace of the local weight matrix. Though it can be shownformally that as \(r \to 0\), the solution converges to the desiredembedding, there is no guarantee that the optimal solution will be foundfor \(r > 0\). This problem manifests itself in embeddings which distortthe underlying geometry of the manifold.

One method to address the regularization problem is to use multiple weightvectors in each neighborhood. This is the essence of *modified locallylinear embedding* (MLLE). MLLE can be performed with functionlocally_linear_embedding or its object-oriented counterpartLocallyLinearEmbedding, with the keyword `method = 'modified'`

.It requires `n_neighbors > n_components`

.

## Complexity#

The MLLE algorithm comprises three stages:

**Nearest Neighbors Search**. Same as standard LLE**Weight Matrix Construction**. Approximately\(O[D N k^3] + O[N (k-D) k^2]\). The first term is exactly equivalentto that of standard LLE. The second term has to do with constructing theweight matrix from multiple weights. In practice, the added cost ofconstructing the MLLE weight matrix is relatively small compared to thecost of stages 1 and 3.**Partial Eigenvalue Decomposition**. Same as standard LLE

The overall complexity of MLLE is\(O[D \log(k) N \log(N)] + O[D N k^3] + O[N (k-D) k^2] + O[d N^2]\).

\(N\) : number of training data points

\(D\) : input dimension

\(k\) : number of nearest neighbors

\(d\) : output dimension

References

“MLLE: Modified Locally Linear Embedding Using Multiple Weights”Zhang, Z. & Wang, J.

## 2.2.5. Hessian Eigenmapping#

Hessian Eigenmapping (also known as Hessian-based LLE: HLLE) is another methodof solving the regularization problem of LLE. It revolves around ahessian-based quadratic form at each neighborhood which is used to recoverthe locally linear structure. Though other implementations note its poorscaling with data size, `sklearn`

implements some algorithmicimprovements which make its cost comparable to that of other LLE variantsfor small output dimension. HLLE can be performed with functionlocally_linear_embedding or its object-oriented counterpartLocallyLinearEmbedding, with the keyword `method = 'hessian'`

.It requires `n_neighbors > n_components * (n_components + 3) / 2`

.

## Complexity#

The HLLE algorithm comprises three stages:

Nearest Neighbors Search. Same as standard LLE

Weight Matrix Construction. Approximately\(O[D N k^3] + O[N d^6]\). The first term reflects a similarcost to that of standard LLE. The second term comes from a QRdecomposition of the local hessian estimator.

Partial Eigenvalue Decomposition. Same as standard LLEThe overall complexity of standard HLLE is\(O[D \log(k) N \log(N)] + O[D N k^3] + O[N d^6] + O[d N^2]\).

\(N\) : number of training data points

\(D\) : input dimension

\(k\) : number of nearest neighbors

\(d\) : output dimension

References

“Hessian Eigenmaps: Locally linear embedding techniques forhigh-dimensional data”Donoho, D. & Grimes, C. Proc Natl Acad Sci USA. 100:5591 (2003)

## 2.2.6. Spectral Embedding#

Spectral Embedding is an approach to calculating a non-linear embedding.Scikit-learn implements Laplacian Eigenmaps, which finds a low dimensionalrepresentation of the data using a spectral decomposition of the graphLaplacian. The graph generated can be considered as a discrete approximation ofthe low dimensional manifold in the high dimensional space. Minimization of acost function based on the graph ensures that points close to each other onthe manifold are mapped close to each other in the low dimensional space,preserving local distances. Spectral embedding can be performed with thefunction spectral_embedding or its object-oriented counterpartSpectralEmbedding.

## Complexity#

The Spectral Embedding (Laplacian Eigenmaps) algorithm comprises three stages:

**Weighted Graph Construction**. Transform the raw input data intograph representation using affinity (adjacency) matrix representation.**Graph Laplacian Construction**. unnormalized Graph Laplacianis constructed as \(L = D - A\) for and normalized one as\(L = D^{-\frac{1}{2}} (D - A) D^{-\frac{1}{2}}\).**Partial Eigenvalue Decomposition**. Eigenvalue decomposition isdone on graph Laplacian.

The overall complexity of spectral embedding is\(O[D \log(k) N \log(N)] + O[D N k^3] + O[d N^2]\).

\(N\) : number of training data points

\(D\) : input dimension

\(k\) : number of nearest neighbors

\(d\) : output dimension

References

“Laplacian Eigenmaps for Dimensionality Reductionand Data Representation”M. Belkin, P. Niyogi, Neural Computation, June 2003; 15 (6):1373-1396

## 2.2.7. Local Tangent Space Alignment#

Though not technically a variant of LLE, Local tangent space alignment (LTSA)is algorithmically similar enough to LLE that it can be put in this category.Rather than focusing on preserving neighborhood distances as in LLE, LTSAseeks to characterize the local geometry at each neighborhood via itstangent space, and performs a global optimization to align these localtangent spaces to learn the embedding. LTSA can be performed with functionlocally_linear_embedding or its object-oriented counterpartLocallyLinearEmbedding, with the keyword `method = 'ltsa'`

.

## Complexity#

The LTSA algorithm comprises three stages:

**Nearest Neighbors Search**. Same as standard LLE**Weight Matrix Construction**. Approximately\(O[D N k^3] + O[k^2 d]\). The first term reflects a similarcost to that of standard LLE.**Partial Eigenvalue Decomposition**. Same as standard LLE

The overall complexity of standard LTSA is\(O[D \log(k) N \log(N)] + O[D N k^3] + O[k^2 d] + O[d N^2]\).

\(N\) : number of training data points

\(D\) : input dimension

\(k\) : number of nearest neighbors

\(d\) : output dimension

References

“Principal manifolds and nonlinear dimensionality reduction viatangent space alignment”Zhang, Z. & Zha, H. Journal of Shanghai Univ. 8:406 (2004)

## 2.2.8. Multi-dimensional Scaling (MDS)#

Multidimensional scaling(MDS) seeks a low-dimensionalrepresentation of the data in which the distances respect well thedistances in the original high-dimensional space.

In general, MDS is a technique used for analyzing similarity ordissimilarity data. It attempts to model similarity or dissimilarity data asdistances in a geometric spaces. The data can be ratings of similarity betweenobjects, interaction frequencies of molecules, or trade indices betweencountries.

There exists two types of MDS algorithm: metric and non metric. Inscikit-learn, the class MDS implements both. In Metric MDS, the inputsimilarity matrix arises from a metric (and thus respects the triangularinequality), the distances between output two points are then set to be asclose as possible to the similarity or dissimilarity data. In the non-metricversion, the algorithms will try to preserve the order of the distances, andhence seek for a monotonic relationship between the distances in the embeddedspace and the similarities/dissimilarities.

Let \(S\) be the similarity matrix, and \(X\) the coordinates of the\(n\) input points. Disparities \(\hat{d}_{ij}\) are transformation ofthe similarities chosen in some optimal ways. The objective, called thestress, is then defined by \(\sum_{i < j} d_{ij}(X) - \hat{d}_{ij}(X)\)

## Metric MDS#

The simplest metric MDS model, called *absolute MDS*, disparities are defined by\(\hat{d}_{ij} = S_{ij}\). With absolute MDS, the value \(S_{ij}\)should then correspond exactly to the distance between point \(i\) and\(j\) in the embedding point.

Most commonly, disparities are set to \(\hat{d}_{ij} = b S_{ij}\).

## Nonmetric MDS#

Non metric MDS focuses on the ordination of the data. If\(S_{ij} > S_{jk}\), then the embedding should enforce \(d_{ij} <d_{jk}\). For this reason, we discuss it in terms of dissimilarities(\(\delta_{ij}\)) instead of similarities (\(S_{ij}\)). Note thatdissimilarities can easily be obtained from similarities through a simpletransform, e.g. \(\delta_{ij}=c_1-c_2 S_{ij}\) for some real constants\(c_1, c_2\). A simple algorithm to enforce proper ordination is to use amonotonic regression of \(d_{ij}\) on \(\delta_{ij}\), yieldingdisparities \(\hat{d}_{ij}\) in the same order as \(\delta_{ij}\).

A trivial solution to this problem is to set all the points on the origin. Inorder to avoid that, the disparities \(\hat{d}_{ij}\) are normalized. Notethat since we only care about relative ordering, our objective should beinvariant to simple translation and scaling, however the stress used in metricMDS is sensitive to scaling. To address this, non-metric MDS may use anormalized stress, known as Stress-1 defined as

\[\sqrt{\frac{\sum_{i < j} (d_{ij} - \hat{d}_{ij})^2}{\sum_{i < j} d_{ij}^2}}.\]

The use of normalized Stress-1 can be enabled by setting `normalized_stress=True`

,however it is only compatible with the non-metric MDS problem and will be ignoredin the metric case.

References

“Modern Multidimensional Scaling - Theory and Applications”Borg, I.; Groenen P. Springer Series in Statistics (1997)

“Nonmetric multidimensional scaling: a numerical method”Kruskal, J. Psychometrika, 29 (1964)

“Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis”Kruskal, J. Psychometrika, 29, (1964)

## 2.2.9. t-distributed Stochastic Neighbor Embedding (t-SNE)#

t-SNE (TSNE) converts affinities of data points to probabilities.The affinities in the original space are represented by Gaussian jointprobabilities and the affinities in the embedded space are represented byStudent’s t-distributions. This allows t-SNE to be particularly sensitiveto local structure and has a few other advantages over existing techniques:

Revealing the structure at many scales on a single map

Revealing data that lie in multiple, different, manifolds or clusters

Reducing the tendency to crowd points together at the center

While Isomap, LLE and variants are best suited to unfold a single continuouslow dimensional manifold, t-SNE will focus on the local structure of the dataand will tend to extract clustered local groups of samples as highlighted onthe S-curve example. This ability to group samples based on the local structuremight be beneficial to visually disentangle a dataset that comprises severalmanifolds at once as is the case in the digits dataset.

The Kullback-Leibler (KL) divergence of the jointprobabilities in the original space and the embedded space will be minimizedby gradient descent. Note that the KL divergence is not convex, i.e.multiple restarts with different initializations will end up in local minimaof the KL divergence. Hence, it is sometimes useful to try different seedsand select the embedding with the lowest KL divergence.

The disadvantages to using t-SNE are roughly:

t-SNE is computationally expensive, and can take several hours on million-sampledatasets where PCA will finish in seconds or minutes

The Barnes-Hut t-SNE method is limited to two or three dimensional embeddings.

The algorithm is stochastic and multiple restarts with different seeds canyield different embeddings. However, it is perfectly legitimate to pick theembedding with the least error.

Global structure is not explicitly preserved. This problem is mitigated byinitializing points with PCA (using

`init='pca'`

).

## Optimizing t-SNE#

The main purpose of t-SNE is visualization of high-dimensional data. Hence,it works best when the data will be embedded on two or three dimensions.

Optimizing the KL divergence can be a little bit tricky sometimes. There arefive parameters that control the optimization of t-SNE and therefore possiblythe quality of the resulting embedding:

perplexity

early exaggeration factor

learning rate

maximum number of iterations

angle (not used in the exact method)

The perplexity is defined as \(k=2^{(S)}\) where \(S\) is the Shannonentropy of the conditional probability distribution. The perplexity of a\(k\)-sided die is \(k\), so that \(k\) is effectively the number ofnearest neighbors t-SNE considers when generating the conditional probabilities.Larger perplexities lead to more nearest neighbors and less sensitive to smallstructure. Conversely a lower perplexity considers a smaller number ofneighbors, and thus ignores more global information in favour of thelocal neighborhood. As dataset sizes get larger more points will berequired to get a reasonable sample of the local neighborhood, and hencelarger perplexities may be required. Similarly noisier datasets will requirelarger perplexity values to encompass enough local neighbors to see beyondthe background noise.

The maximum number of iterations is usually high enough and does not needany tuning. The optimization consists of two phases: the early exaggerationphase and the final optimization. During early exaggeration the jointprobabilities in the original space will be artificially increased bymultiplication with a given factor. Larger factors result in larger gapsbetween natural clusters in the data. If the factor is too high, the KLdivergence could increase during this phase. Usually it does not have to betuned. A critical parameter is the learning rate. If it is too low gradientdescent will get stuck in a bad local minimum. If it is too high the KLdivergence will increase during optimization. A heuristic suggested inBelkina et al. (2019) is to set the learning rate to the sample sizedivided by the early exaggeration factor. We implement this heuristicas `learning_rate='auto'`

argument. More tips can be found inLaurens van der Maaten’s FAQ (see references). The last parameter, angle,is a tradeoff between performance and accuracy. Larger angles imply that wecan approximate larger regions by a single point, leading to better speedbut less accurate results.

“How to Use t-SNE Effectively”provides a good discussion of the effects of the various parameters, as wellas interactive plots to explore the effects of different parameters.

## Barnes-Hut t-SNE#

The Barnes-Hut t-SNE that has been implemented here is usually much slower thanother manifold learning algorithms. The optimization is quite difficultand the computation of the gradient is \(O[d N log(N)]\), where \(d\)is the number of output dimensions and \(N\) is the number of samples. TheBarnes-Hut method improves on the exact method where t-SNE complexity is\(O[d N^2]\), but has several other notable differences:

The Barnes-Hut implementation only works when the target dimensionality is 3or less. The 2D case is typical when building visualizations.

Barnes-Hut only works with dense input data. Sparse data matrices can only beembedded with the exact method or can be approximated by a dense low rankprojection for instance using PCA

Barnes-Hut is an approximation of the exact method. The approximation isparameterized with the angle parameter, therefore the angle parameter isunused when method=”exact”

Barnes-Hut is significantly more scalable. Barnes-Hut can be used to embedhundred of thousands of data points while the exact method can handlethousands of samples before becoming computationally intractable

For visualization purpose (which is the main use case of t-SNE), using theBarnes-Hut method is strongly recommended. The exact t-SNE method is usefulfor checking the theoretically properties of the embedding possibly in higherdimensional space but limit to small datasets due to computational constraints.

Also note that the digits labels roughly match the natural grouping found byt-SNE while the linear 2D projection of the PCA model yields a representationwhere label regions largely overlap. This is a strong clue that this data canbe well separated by non linear methods that focus on the local structure (e.g.an SVM with a Gaussian RBF kernel). However, failing to visualize wellseparated hom*ogeneously labeled groups with t-SNE in 2D does not necessarilyimply that the data cannot be correctly classified by a supervised model. Itmight be the case that 2 dimensions are not high enough to accurately representthe internal structure of the data.

References

“Visualizing High-Dimensional Data Using t-SNE”van der Maaten, L.J.P.; Hinton, G. Journal of Machine Learning Research (2008)

“t-Distributed Stochastic Neighbor Embedding” van der Maaten, L.J.P.

“Accelerating t-SNE using Tree-Based Algorithms”van der Maaten, L.J.P.; Journal of Machine Learning Research 15(Oct):3221-3245, 2014.

“Automated optimized parameters for T-distributed stochastic neighborembedding improve visualization and analysis of large datasets”Belkina, A.C., Ciccolella, C.O., Anno, R., Halpert, R., Spidlen, J.,Snyder-Cappione, J.E., Nature Communications 10, 5415 (2019).

## 2.2.10. Tips on practical use#

Make sure the same scale is used over all features. Because manifoldlearning methods are based on a nearest-neighbor search, the algorithmmay perform poorly otherwise. See StandardScalerfor convenient ways of scaling heterogeneous data.

The reconstruction error computed by each routine can be used to choosethe optimal output dimension. For a \(d\)-dimensional manifold embeddedin a \(D\)-dimensional parameter space, the reconstruction error willdecrease as

`n_components`

is increased until`n_components == d`

.Note that noisy data can “short-circuit” the manifold, in essence actingas a bridge between parts of the manifold that would otherwise bewell-separated. Manifold learning on noisy and/or incomplete data isan active area of research.

Certain input configurations can lead to singular weight matrices, forexample when more than two points in the dataset are identical, or whenthe data is split into disjointed groups. In this case,

`solver='arpack'`

will fail to find the null space. The easiest way to address this is touse`solver='dense'`

which will work on a singular matrix, though it maybe very slow depending on the number of input points. Alternatively, onecan attempt to understand the source of the singularity: if it is due todisjoint sets, increasing`n_neighbors`

may help. If it is due toidentical points in the dataset, removing these points may help.

See also

Totally Random Trees Embedding can also be useful to derive non-linearrepresentations of feature space, also it does not performdimensionality reduction.