Text clustering aims to cluster texts based on their semantic content.
To cluster documents we follow these three steps:
-
Embedding documents:
First we need to convert the documents to embeddings using an embedding model.
The created embeddings will represent the features that we want to cluster.
We need to choose a model that was trained for semantic similarity tasks.
Example:
-
Reducing the dimensionality of the documents's embeddings:
To ease clustering a large volume of data we can choose to reduce the dimensions of the embeddings using a dimensionality reduction library.
This process might cause the loss of information and hence the clustering might not be very accurate.
Example:
See this page for more details about UMAP (Uniform Manifold Approximation and Projection) for dimension reduction:
https://umap-learn.readthedocs.io/en/latest/index.html
-
Clustering the reduced embeddings:
The last step is to use a clustering library to find groups of semantically similar documents.
Example:
See this page for more details about the HDBSCAN clustering Library:
https://hdbscan.readthedocs.io/en/latest/index.html
Example: Clustering a set of words
Required modules:
Python code:
Output:
Chart of the clusters: