Text Clustering with Large Language Models | Word Clustering Example

LLMs | Text Clustering

Text Clustering
Example: word clustering

Text Clustering
Text clustering is an unsupervised machine learning technique that groups documents or text snippets based on their semantic similarity. Unlike classification, clustering doesn't require labeled data—instead, it discovers hidden patterns and structures within text collections. This makes it invaluable for exploratory data analysis, content organization, and understanding large text corpora.

Key Applications:
- Document organization and categorization
- Customer feedback analysis
- News article grouping
- Academic paper classification
- Social media content analysis
- Market research and trend detection
Example (to simplify, we use one-word sentences):
```
INPUT (unstructured textual data): ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
```
```
OUTPUT (clusters of semantically similar data):
Cluster 0 ['cats', 'dogs', 'elephants', 'birds'],
Cluster 1 ['cars', 'trains', 'planes']
```
To cluster documents we follow these three steps:
- Text Embedding Generation:
  The foundation of semantic clustering lies in converting text into numerical representations that capture meaning. Modern embedding models use transformer architectures trained on vast text corpora to understand semantic relationships.
  
  Popular Embedding Models:
  - all-MiniLM-L12-v2: Fast, efficient, good general performance
  - all-mpnet-base-v2: Higher quality, slightly slower
  - text-embedding-ada-002 (OpenAI): Commercial option with excellent performance
  - multilingual-E5-large: For multilingual applications
  Example (illustrative values):
```
INPUT: texts
['cats', 'dogs', ...]
```
```
OUTPUT: embeddings
cats [1,0,0,0,1, ...],
dogs [2,0,0,0,2, ...],
...
```
  Example Implementation:
```
from sentence_transformers import SentenceTransformer

# load embedding model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")

# generate embeddings
texts = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
embeddings = embedding_model.encode(texts)

print(f'Embedding shape: {embeddings.shape}') # (7, 384)
```
- Dimensionality Reduction (optional but recommended):
  To ease clustering a large volume of data, we can choose to reduce the dimensions of the embeddings using a dimensionality reduction library. This process might cause the loss of information and hence the clustering might not be very accurate.
  
  Example (illustrative values):
```
INPUT (embedding with dimension 5): [1,0,0,0,1]
```
```
OUTPUT (compressed embedding with dimensions 3): [2,0,2]
```
  High-dimensional embeddings can face the "curse of dimensionality" in clustering, where increasing dimensions require exponentially more data to capture patterns accurately, leading to issues like data sparsity, distance concentration, and overfitting.
  
  Dimensionality reduction techniques help by:
  - Reducing computational complexity
  - Eliminating noise and redundant features
  - Improving clustering algorithm performance
  - Enabling visualization in 2D/3D space
  UMAP (Uniform Manifold Approximation and Projection) is preferred because it:
  - Preserves both local and global structure
  - Handles non-linear relationships effectively
  - Maintains cluster separation better
  - Provides more interpretable low-dimensional representations
  UMAP Configuration Guidelines:
```
from umap import UMAP

# conservative reduction for clustering
reducer = UMAP(
    n_components=50, # Moderate reduction
    n_neighbors=15, # Local neighborhood size
    min_dist=0.0, # Tight clusters
    metric='cosine', # Good for text embeddings
    random_state=42 # Reproducibility
)

reduced_embeddings = reducer.fit_transform(embeddings)
```
  Parameter Tuning Tips:
  - n_components: Start with 10-50 for clustering, 2-3 for visualization
  - n_neighbors: Higher values preserve global structure, lower values preserve local structure
  - min_dist: Lower values create tighter clusters
  - metric: Use 'cosine' for text embeddings, 'euclidean' for other data
  See this page for more details about UMAP (Uniform Manifold Approximation and Projection) for dimension reduction:
  https://umap-learn.readthedocs.io/en/latest/index.html
- Clustering Algorithm Selection:
  The last step is to use a clustering library to find groups of semantically similar documents.
  
  Example (illustrative values):
```
INPUT: (embeddings): [1,0,1], [2,0,2], ...
```
```
OUTPUT:
Cluster 0 ['cats', 'dogs', 'elephants', 'birds'],
Cluster 1 ['cars', 'trains', 'planes']
```
  HDBSCAN (Hierarchical Density-Based Spatial Clustering) excels at text clustering because it:
  - Automatically determines the number of clusters
  - Handles clusters of varying densities and shapes
  - Identifies outliers and noise points
  - Provides hierarchical cluster structure
  - Doesn't assume spherical clusters
  HDBSCAN Parameter Tuning:
```
from hdbscan import HDBSCAN

clusterer = HDBSCAN(
    min_cluster_size=5,            # Minimum points per cluster
    min_samples=3,                 # Core point threshold
    metric='euclidean',            # Distance metric
    cluster_selection_method='eom' # Excess of Mass
)
```
  See this page for more details about the HDBSCAN clustering Library:
  https://hdbscan.readthedocs.io/en/latest/index.html

Example: word clustering

To simplify, we use one-word sentences in this example.

Install the required modules:

$ pip install umap-learn
$ pip install hdbscan
$ pip install matplotlib

Python code:

$ vi clustering.py

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from matplotlib import pyplot
import numpy as np

# load embedding model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")

# generate embeddings
texts = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']
embeddings = embedding_model.encode(texts)

print(f'Number of the embedded documents and their dimensions: {embeddings.shape}')

# reduce the embeddings dimensions
reduced_embeddings = UMAP(n_components=5, random_state=42).fit_transform(embeddings)

print(f'Number of the embedded documents and their reduced dimensions: {reduced_embeddings.shape}')

# create an hdbscan object and fit the model to the data
cluster_algorithm = HDBSCAN(min_cluster_size=2).fit(reduced_embeddings)

# get the cluster labels (note: -1 means noise points)
cluster_labels = cluster_algorithm.labels_

# get the number of clusters
n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)

print(f'Number of clusters: {n_clusters}')

# print features in cluster 0
print("Cluster 0:")
cluster = 0
for index in np.where(cluster_labels==cluster)[0][:10]:
    print(f'Feature {index}: {texts[index][:10]}')

# print features in cluster 1
print("Cluster 1:")
cluster = 1
for index in np.where(cluster_labels==cluster)[0][:10]:
    print(f'Feature {index}: {texts[index][:10]}')

# plot the results
pyplot.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], c=cluster_labels, cmap='Spectral', s=40)
pyplot.colorbar()
pyplot.savefig('hdbscan_cluster_plot.png')

Run the Python script:

$ python3 clustering.py

Output:

Number of the embedded documents and their dimensions: (7, 384)

Number of the embedded documents and their reduced dimensions: (7, 5)

Number of clusters: 2

Cluster 0:
Feature 0: cats
Feature 1: dogs
Feature 2: elephants
Feature 3: birds

Cluster 1:
Feature 4: cars
Feature 5: trains
Feature 6: planes

Chart of the clusters: hdbscan_cluster_plot.png
Clusters Plot