How to use the Top2Vec model for thematic modeling?
In natural language processing, we see different types of modeling that allow applications to interact with human language. In the recent scenario, we have seen the importance of this modeling in various fields. Topic modeling is also a part of NLP which is used to extract topics from a set of different documents and various researches and works have been done in topic modeling. Top2Vec is also an approach or an algorithm to perform thematic modeling. In this article, we are going to discuss topic modeling and the Top2Vec algorithm for topic modeling. The main points to be discussed in the article are listed below.
- About topic modeling
- Algorithms for topic modeling
- What is Top2Vec?
- Top2Vec model working procedure
- Step 1: Generation of vectors and integration words
- Step 2: Perform dimension reduction of integration vectors
- Step 3: Perform clustering on reduced vectors
- Step 4: Calculation of the barycenters of the clusters
- Step 5: Assignment of subject to words
Let’s start by understanding what topic modeling is.
About topic modeling
Topic modeling is a type of natural language processing process that deals with the discovery of the presentation of semantic structure in text documents. We can also compare this modeling with the statistical modeling that comes into play when it is necessary to discover the abstract topics that appear in textual data. For example, in an article, there are words like data science and data analytics, then the article will be about data science.
It is possible that the article is 60% focused on data science and 40% of the content is about cloud services. Then we can consider it to have 1.5 times more data science words than cloud services words.
Algorithmically, we can consider this process as a clustering process where the modeling creates the cluster of similar words. This is similar to the other NLP modeling as it also uses a mathematical framework to capture the intuition behind the documents, with the help of the mathematical framework the algorithms examine the documents and find out the topics because in the mathematical framework there are availability of word statistics.
We can also consider topic modeling as a type of probabilistic modeling because probability is used to discover the latent semantic structures of a document. In most projects, this modeling can be considered as a text mining tool.
The image above is a representation of the discovery process using a document word matrix. In the image we can see that the columns represent the document and the rows represent the word.
In the matrix, the cell is used to store the frequency of the word in the document and the intensity of the color represents the frequency. Using topic modeling, we can create groups of documents that use similar words and words that occurred in a similar set of documents. The end result represents the subjects.
Are you looking for a comprehensive repository of Python libraries used in data science, check here.
Algorithms for topic modeling
This type of modeling has been part of many types of research since 1998 when it was first explained by Papadimitriou, Raghavan, Tamaki and Vempala and they called it probabilistic latent semantic analysis (PLSA). LDA (Latent Dirichlet Allocation) is the most widely used algorithm for thematic modeling.
In a variety of techniques we can see the inclusion of the SVD (singular value decomposition) method and in some of the other techniques the use of the non-negative matrix factorization method can be seen. In recent years, when charts were introduced, the implementation of the stochastic block model can also be seen.
In this article, we are going to discuss one such technique named Top2Vec which has represented a potential result level in subject modeling which uses vectors and grouping to complete its work. Let’s introduce the Top2Vec model.
What is Top2Vec?
Top2Vec can be considered as an algorithm to perform subject modeling in a very simple way. We can also say that it is a transformer to perform subject modeling. It is not only limited to topic modeling but can also be used for semantic relationship searches in documents. By using this algorithm, we can automatically recognize the topic under a text document and this algorithm generates jointly integrated topic, document and word vectors.
Below we can see the important usage of Top2Vec:
- Get the number of detected documents
- Get topic content and size
- Finding the Hierarchy in Topics
- Use keywords to search for topics
- Using topics to find a document
- Use keywords to search for documents
- Find similar words
- Find the same documents.
Atomic features are a very important thing about this algorithm and it also has functions that can work with long and short text. We can install this algorithm using the following lines of code.
!pip install top2vec
Its realization is here. In this article, we will see how it works.
Subject modeling with Top2Vec
In the above we talked about what can be done using Top2Vec and to perform these tasks the following step is used:
- Generation of integration vectors and words
- Perform dimension reduction of integration vectors
- Perform clustering on reduced vectors
- Calculation of the barycenters of the clusters
- Assignment of subject to words
Let’s explain all the steps one by one.
Step 1: Generation of vectors and integration words
This step includes the generation of integration vectors that allow us to represent the text document in the mathematical framework. This framework can be multidimensional where the dimension depends on the word or text document. This can be done using Doc2Vec or Universal Sentence Encoder or BERT Sentence Transformer.
The image above is a representation of the general word vector using the hot word embedding system.
Step 2: Perform dimension reduction of integration vectors
In this step, the high-dimensional document vectors generated from the vectors are reduced in size. This is a basic dimension reduction process and Top2Vec uses the UMAP dimension reduction technique; this allows the next steps to find a dense area for regrouping.
The image above represents the word vectors under the documents and we can see that they are dense and can be separated into groups.
Step 3: Perform clustering on reduced vectors
This step divides the dimensionally reduced vectors into different groups using the HDBSCAN clustering technique. This step can give us an approximation of the subject numbers in the documents.
The image above is the representation of step 3 where colors are used to separate vectors from different groups.
Step 4: Calculation of the barycenters of the clusters
This step can be considered as our start of topic modeling where we calculate the centroid of each dense area of the clusters from step 3 and the final vectors we get from this vector can be called our topic vector.
In the image above we can see that there are three types of dots and the red is sparse and far from the other dots so they can be considered outliers using the blue dots which are dense. The algorithm calculates the topic vector.
Step 5: Assignment of subject to words
This step is the last step of Top2Vec, where it finds the n closest word vectors and passes them to the topic vector so that they can become topic words. The image below is the representation of the final step.
Here we can see how the Top2Vec finally gives us the result of thematic modeling. We can find its implementation on GitHub here.
In this article, we have discussed the subject of modeling which is part of natural language processing and the Top2Vec algorithm. We can use the Top2Vec algorithm to perform topic modeling. The implementation I mentioned above can be used to run Top2Vec.