• Home
  • Docker
  • Kubernetes
  • LLMs
  • Java
  • Ubuntu
  • Maven
  • Big Data
  • Archived
LLMs | Text Clustering
  1. Text Clustering

  1. Text Clustering
    Text clustering aims to cluster texts based on their semantic content.


    To cluster documents we follow these three steps:

    • Embedding documents:
      First we need to convert the documents to embeddings using an embedding model. The created embeddings will represent the features that we want to cluster. We need to choose a model that was trained for semantic similarity tasks.

      Example:

    • Reducing the dimensionality of the documents's embeddings:
      To ease clustering a large volume of data we can choose to reduce the dimensions of the embeddings using a dimensionality reduction library. This process might cause the loss of information and hence the clustering might not be very accurate.

      Example:

      See this page for more details about UMAP (Uniform Manifold Approximation and Projection) for dimension reduction:
      https://umap-learn.readthedocs.io/en/latest/index.html

    • Clustering the reduced embeddings:
      The last step is to use a clustering library to find groups of semantically similar documents.

      Example:

      See this page for more details about the HDBSCAN clustering Library:
      https://hdbscan.readthedocs.io/en/latest/index.html

    Example: Clustering a set of words

    Required modules:

    Python code:

    Output:

    Chart of the clusters:
    Clusters Plot
© 2025  mtitek