Advanced Topic Detection with Deep Learning

6 min readApr 10, 2023

Use BERT, UMAP and HDBSCAN to capture document topics, following closely a state-of-the-art BERTopic architecture (transformer encoder).

Topic detection is an NLP task aimed to extract global "topics" from a corpus of text documents. For example, if are looking at a dataset of book descriptions, topic detection will allow us to classify books into categories, like: "romance", "sci-fi", "travel", etc.

In this tutorial we will use the HuggingFace library implementation of BERT together with HDBSCAN for clustering and UMAP for dimensionality reduction. The pipeline will follow the BERTopic structure proposed by Maarten Grootendorst:

Lets get started!

For sake of simplicity I recommend running the code in Google Coolab, but another platform is also good.

Start by installing the necessary dependencies:

!pip install pandas numpy umap-learn transformers plotly hdbscan

Then proceed to load the input data with:

import pandas as pd
data = pd.read_csv("ecommerce.csv", on_bad_lines='skip', nrows=500)
data = data[[""]]

in our example the data corresponds to an electronic commerce shop extracted from Kaggle. I've sampled the data down to 500 rows to make the code run faster, but you can use the full dataset if needed. The data looks like this:

as we can see the column "text" contains the article descriptions. Our topic modelling goal is find the correct article category for each article description. For instance: the description "Joyo Multi-Utility Compact Foldable Table" can be labelled as a "household" item.

We will call the entire dataset a corpus, each text row a document and each (sub-)word a token.

BERT to find document encodings

To be able to properly cluster articles in their right department, we need to vectorise and embedded the text descriptions in a latent space such that elements belonging to the same department are geometrically closer to each other. We will use a BERT as data encoder.

Start by loading the BERT model. HuggingFace allow us to download the pre-trained model from a given model instance (a.k.a. "checkpoint"). In that way we don't have to train the full BERT ourselves!

# load BERT model (for embeddings)
from transformers import BertTokenizer, TFBertModel
checkpoint = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(checkpoint)
model = TFBertModel.from_pretrained(checkpoint)

then we load the tokenizer. The tokenizer will do 3 main things:

Break each document into a set of words/subwords (a.k.a. tokens) such that they can be digested by the model.
For example: ”Hello World!” → [‘hello’, ‘world’, ‘!’]
Map each word token to a given set of index id's
For example: [‘hello’, ‘world’, ‘!’] → [7592, 2088, 999]
Note that ids are not embeddings, but rather just numeric identifiers to be used by the model to map tokens in your document to tokens in its vocabulary. With the pre-training, the model knows the relationships between tokens in its own vocabulary.
Add special tokens such that the model knows when sentences start/end, separations, etc. For example: [EOS], [SEP], …

Tokenization is exectuted as follows:

# Tokenize corpus with the BERT tokenizer (WordPiece algorithm)
descr_processed_tokenized = tokenizer(
    list(data["text"]),
    return_tensors="tf",
    truncation=True,
    padding=True,
    max_length=128,
)

After tokenization, our text data — now represented numerically as a set of token id's — can be ingested by the BERT model

# Encode corpus using BERT
output_bert = model(descr_processed_tokenized)

This will produce embeddings for the entire input data. When applied to a single sentence, the model output will look like this:

output of BERT for a single text example

as we can see, BERT has encoded each token in input with a 768-length embedding vector.

However we are interested in the embedding of the entire article description instead of just a list of word embeddings. To extract an embedding for the entire document, we just average over all word embeddings in the document.

# Get sentence embeddings from BERTs word embeddings
import numpy as np
mean_vect = []
for vect in output_bert.last_hidden_state:
    mean_vect.append(np.mean(vect, axis=0))
data = data.assign(descr_vect=mean_vect)

Now, finally, we have assigned a vector to each element in our input table, thisis known as document embedding

UMAP to project the vector embeddings to a lower dimension

Applying a clustering algorithm over a 768-dimensional space is not feasible due to the curse of dimensionality. Therefore is necessary to apply a dimensionality reduction technique.

One very popular technique is known as UMAP (Uniform Manifold Approximation and Projection). This technique is known for respecting both the global and the local structure of the data after the reduction.

Mathematical Note: UMAP is based on the assumption that data classes can be represented as manifolds (manifold hypothesis) and the assumption that those manifolds are differentiable, such that UMAP's fuzzy logic is applicable.

We will reduce the dimension of the embedding vectors to 3 dimensions, such that we can visualise the result, lets do as follows:

# Use UMAP to lower the dimensionality of the embedding to 3D
import umap
descr_vect_3d = umap.UMAP(n_components=3).fit_transform(
  np.stack(data["descr_vect"].values)
)
data["descr_vect_3d"] = list(descr_vect_3d)

stack maps a nested array (array(array(…))) into a single array. Our data now looks like this

data after reducing the dimension of the embeddings from 768d to 3d

Mathematical Note: It is theoretically incorrect to use UMAP for downstream processing tasks like density-based clustering. It is well known that embeddings produced by UMAP might not be density preserving and might produce density discontinuities in latent manifolds. Nevertheless, this is often not an issue found in practice. Learn more here.

Apply Clustering to Capture Topics

At this point we have successfully mapped documents to vectors on a 3d latent space. Now it’s time to apply a clustering algorithm to find relevant topics.

We use HDBSCAN for clustering as it makes few assumptions about the structure of the data, whilst also producing top notch results even with little-to-none parameter tweaking

# Use BERT's + UMAP vector embeddings for clustering using HDBSCAN
import hdbscan
clustering = hdbscan.HDBSCAN().fit(np.stack(data["descr_vect_3d"].values))
data["cluster_label"] = clustering.labels_

Plotting data[“descr_vect_3d”].values as a 3d scatter plot we get the following

Topics detected (color) over a 3d document-embedding space

topics here are representyed by colors. The algorithm labels them simply as {0, 1, 2}.

Mathematical Note: HDBSCAN is an extension of DBSCAN using graph pruning to reveal the optimal scale for underlying clusters in the data, thus resolving the intrinsec non-uniqueness in cluster hierarchy. Consequently, the number of clusters is found automatically by the algorithm.

Finding Topics

After a simple inspection at the cluster labels and the descriptions behind them we recognise some obvious topic categories:
topic = 0 → Books
topic = 1→ Household
topic = 2→ Clothing & Accessories

To finalise, lets annotate our data with the corresponding article categories (i.e. document topics).

data["article category"] = data["cluster_label"].map(
    {0: 'Books', 1: 'Household', 2: 'Clothing & Accessories'})

then our final data looks like:

Data annotated with desired article categories

This is our desired result. Thanks for reading and keep learning!

Want to learn more?

Check my other socials:
LinkedIn, HuggingFace, GitHub, Medium, YouTube