Topic Modeling with LLMs: Examples & Label Generation

LLMs | Topic Modeling

Topic Modeling
Example: Basic Topic Modeling
Example: Generating Topic Labels

Topic Modeling
Topic modeling is an unsupervised machine learning technique that automatically discovers abstract topics within a collection of documents. It identifies patterns in word usage and groups documents that share similar themes, providing insights into the underlying structure of large text corpora.

Key Benefits:
- Automatically organize large document collections.
- Discover hidden themes and patterns in text data.
- Reduce dimensionality of text data for analysis.
- Enable content recommendation and search improvements.
- Support exploratory data analysis of textual content.
Example (to simplify, we use one-word sentences):
```
Cluster 0: ['cats', 'dogs', 'elephants', 'birds'] ==> topic: animals
Cluster 1: ['cars', 'trains', 'planes'] ==> topic: car
```
BERTopic is a topic modeling technique that leverages transformer-based embeddings to create more semantically meaningful topics.
In BERTopic, document clusters are formed based on semantic similarity and then interpreted as topics.

The topic modeling steps in BERTopic:
- Document Embeddings: Convert documents into high-dimensional vector representations using transformer models.
- Dimensionality Reduction: Use UMAP to reduce embedding dimensions while preserving local structure.
- Clustering: Apply HDBSCAN to group similar documents into clusters.
- Topic Representation: Extract representative keywords for each cluster using TF-IDF or other representation models.
BERTopic characteristics:
- Semantic Understanding: Uses contextual embeddings that capture word meaning better than bag-of-words approaches.
- Hierarchical Structure: Supports topic hierarchies and subtopics.
- Flexibility: Modular design allows customization of each component.
- Visualization: Rich visualization capabilities for topic exploration.
See this page for more details about BERTopic:
https://maartengr.github.io/BERTopic/index.html

Example: Basic Topic Modeling

Let's start with a basic example using individual words to understand the fundamental concepts.

Install the required modules:

$ pip install bertopic

Python code:

$ vi topic-modeling.py

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic

# sample data - in practice, you'd use full sentences or documents
sentences = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']

# initialize the sentence transformer model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")

# generate embeddings
embeddings = embedding_model.encode(sentences)

# configure BERTopic with custom parameters + fit the model
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=UMAP(n_components=5, random_state=42),
    hdbscan_model=HDBSCAN(min_cluster_size=2),
    verbose=True
).fit(sentences, embeddings)

# display results
print("Topics info:")
print(topic_model.get_topic_info())

print("Topic 0 info:")
print(topic_model.get_topic(0))

print("Topic 1 info:")
print(topic_model.get_topic(1))

# create and save visualizations
fig = topic_model.visualize_barchart()
fig.write_html("bertopic-barchart-figure.html")

Run the Python script:

$ python3 topic-modeling.py

Output:

Topics info:
       Topic  Count    Name                         Representation                                 Representative_Docs
0      0      4        0_cats_birds_elephants_dogs  [cats, birds, elephants, dogs, , , , , , ]     [birds, cats, dogs]
1      1      3        1_cars_trains_planes_        [cars, trains, planes, , , , , , , ]           [planes, cars, trains]

Topic 0 info:
[
    ('cats', np.float64(0.34657359027997264)),
    ('birds', np.float64(0.34657359027997264)),
    ('elephants', np.float64(0.34657359027997264)),
    ('dogs', np.float64(0.34657359027997264)),
    ('', 1e-05),
    ('', 1e-05),
    ('', 1e-05),
    ('', 1e-05),
    ('', 1e-05),
    ('', 1e-05)
]

Topic 1 info:
[
    ('cars', np.float64(0.46209812037329684)),
    ('trains', np.float64(0.46209812037329684)),
    ('planes', np.float64(0.46209812037329684)),
    ('', 1e-05),
    ('', 1e-05),
    ('', 1e-05),
    ('', 1e-05),
    ('', 1e-05),
    ('', 1e-05),
    ('', 1e-05)
]

Topics are represented by the main keywords extracted from the text. Each topic name is formed by concatenating these keywords using underscores ("_"). A special topic labeled "-1" may also appear; it typically includes all keywords that do not clearly match any specific topic. This category may also contain outliers—data points that do not align with any of the identified topics.

Chart of the topics (Topic Word Scores): bertopic-barchart-figure.html
Topics Plot

Example: Generating Topic Labels

One of BERTopic's powerful features is the ability to generate human-readable topic labels using language models.

In our example, we will create a prompt that has two parts:

A subset of documents that best represent the topics will be inserted using the [DOCUMENTS] tag.
The keywords that make up the topic cluster will be inserted using the [KEYWORDS] tag.

INPUT
+ subset of documents
+ data
list of documents:
[DOCUMENTS]
list of keywords:
[KEYWORDS]
predict the label of the topic.

OUTPUT:
<labeled topics>

Python code:

$ vi label-topic-modeling.py

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
from transformers import pipeline
from bertopic.representation import TextGeneration

sentences = ['cats', 'dogs', 'elephants', 'birds', 'cars', 'trains', 'planes']

# initialize the sentence transformer model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")

# create embeddings
embeddings = embedding_model.encode(sentences)

# configure BERTopic with custom parameters + fit the model
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=UMAP(n_components=5, random_state=42),
    hdbscan_model=HDBSCAN(min_cluster_size=2),
    verbose=True
).fit(sentences, embeddings)

# prompt for topic labeling
prompt = """These documents belong to the same topic:
[DOCUMENTS]

These keywords give details about the topic: '[KEYWORDS]'.

Given these documents and keywords, what is this topic about?"""

# initialize text generation pipeline
# use a model ("google/flan-t5-small") to label the topics
generator = pipeline("text2text-generation", model="google/flan-t5-small")

# create representation model
representation_model = TextGeneration(
    generator,
    prompt=prompt,
    doc_length=50,
    tokenizer="whitespace"
)

# update topics with the generated labels
topic_model.update_topics(sentences, representation_model=representation_model)

# print the topic labels
print(topic_model.get_topic_info())

Run the Python script:

$ python3 label-topic-modeling.py

Output:

       Topic  Count          Name           Representation                 Representative_Docs
0      0      4              0_animals___  [animals, , , , , , , , , ]     [birds, cats, dogs]
1      1      3              1_car___      [car, , , , , , , , , ]         [planes, cars, trains]