Skip to content

Latest commit

 

History

History
94 lines (71 loc) · 4.46 KB

bertopic.md

File metadata and controls

94 lines (71 loc) · 4.46 KB

Using BERTopic at Hugging Face

BERTopic is a topic modeling framework that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports all kinds of topic modeling techniques:

GuidedSupervisedSemi-supervised
ManualMulti-topic distributionsHierarchical
Class-basedDynamicOnline/Incremental
MultimodalMulti-aspectText Generation/LLM
Zero-shot (new!)Merge Models (new!)Seed Words (new!)

Exploring BERTopic on the Hub

You can find BERTopic models by filtering at the left of the models page.

BERTopic models hosted on the Hub have a model card with useful information about the models. Thanks to BERTopic Hugging Face Hub integration, you can load BERTopic models with a few lines of code. You can also deploy these models using Inference Endpoints.

Installation

To get started, you can follow the BERTopic installation guide. You can also use the following one-line install through pip:

pip install bertopic

Using Existing Models

All BERTopic models can easily be loaded from the Hub:

frombertopicimportBERTopictopic_model=BERTopic.load("MaartenGr/BERTopic_Wikipedia")

Once loaded, you can use BERTopic's features to predict the topics for new instances:

topic, prob=topic_model.transform("This is an incredible movie!") topic_model.topic_labels_[topic]

Which gives us the following topic:

64_rating_rated_cinematography_film 

Sharing Models

When you have created a BERTopic model, you can easily share it with others through the Hugging Face Hub. To do so, we can make use of the push_to_hf_hub function that allows us to directly push the model to the Hugging Face Hub:

frombertopicimportBERTopic# Train modeltopic_model=BERTopic().fit(my_docs) # Push to HuggingFace Hubtopic_model.push_to_hf_hub( repo_id="MaartenGr/BERTopic_ArXiv", save_ctfidf=True )

Note that the saved model does not include the dimensionality reduction and clustering algorithms. Those are removed since they are only necessary to train the model and find relevant topics. Inference is done through a straightforward cosine similarity between the topic and document embeddings. This not only speeds up the model but allows us to have a tiny BERTopic model that we can work with.

Additional Resources

close