Topic modeling is about finding themes (topics) within clusters of textual documents.
The topics are labels (keywords) that capture the meaning of the cluster.
BERTopic is a modular topic modeling technique that extract topic representations.
BERTopic uses two steps to extract topics: text clustering and representation topics.
The text clustering step provides BERTopic with the clusters of the documents that are semantically similar.
See this page for more details about BERTopic:
https://maartengr.github.io/BERTopic/index.html
Example: Creating topics of a set of words
Required modules:
Python code:
Output:
Chart of the topics:

Topics are represented by the main keywords, which are concatenated with the underscore character (“_”).
A specific topic with the tag "-1" can be listed and should include all keywords that do not match a specific topic.
This topic may also include outliers which are candidates that do not match any of the found topics.
BERTopic can use a model to generate proper labels for the topics.
For that we need to craft a prompt that should have two parts:
-
A subset of documents that best represent the topics that will inserted using the [DOCUMENTS] tag.
-
The keywords that make up the topics of the cluster that will be inserted using the [KEYWORDS] tag.
Example: Creating labeled topics of a set of words
Python code:
Output: