Visual Representation of Topic Clusters (Part 2)

This is a sequel of the article titled “Visual Representation of Topic Clusters (Part 1)” . Here we will look at the pcoa and tsne representations of topic models using LDAvis.

PCOA or principal coordinate analysis is also known as classical multidimensional scaling. The visual above represents pcoa on an NMF(non-negative matrix factorization) model with 9 topics. This shows global clusters which represent the marginal distribution of the terms across the topics in the entire corpus. The bar chart showing top-30 most salient terms in the whole corpus on the right panel.

Observe that topic 1, to which 23.9% of the tokens in the corpus belong. Clustered around the selected topic are topics 5,4,3 and 2. The bigger the circles, the higher the frequency of the said terms and the further the distance between, the lesser the similarity between terms amongst topics. In other words, topic 9 is more distinct in theme to topic 1 in comparison with topic 5. On the other hand, though it is similar to theme of topic 2, it has very few overlapping terms with topic 2 as compared with topic 5.

In the tsne representation, topic 2 is selected to observe. As observed in the pcoa, it is similar in meaning to topic 1 from the previous model and represents 20.2% of the corpus and is the second most dominant topic in the entire corpus. Similar to the visualization, from the previous article, the distribution is skewed and the term ’neural’ is the most relevant term when lambda = 1. This is selected to compare it with a topic with the same theme in both topic models, the overall ranking of terms and change in rankings with respect to change in lambda values is similar in both representations. The distance between topics in this case is local, which means that amongst themselves, the topics are dissimilar when compared with each other. This is a clear presentation of the variety of the sub-themes in the corpus.

When compared with the pcoa analysis, in respect to the overall corpus, they are similar and the model would have a high coherence score. This means that overall the documents in the corpus come from a similar background or subject.

Both tsne and pcoa are unsupervised dimensionality reduction techniques, however, pcoa is linear and tsne is non-linear. While pcoa preserves the global cluster of data, tsne preserves the local cluster of data and is a randomized algorithm.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store