11/25/2023 0 Comments Topic coherence scoreThus, in order to have a good model, you will need good clusters. The assumption here is that good clusters lead to good topic representations. Essentially, BERTopic is a clustering algorithm with a topic representation on top. One thing that might be interesting to look at is clustering metrics. Especially NPMI and Topic Diversity are frequently used metrics as a proxy of the "quality" of these topic modeling techniques. That has happened to me more times than I would like to admit! The metrics that you find in the paper and in OCTIS are, at least in my experience, the most common metrics that you see in academia. So anything you suggest that is not referenced there would be super. Of course right after writing this I remembered that I hadn't gone back to the paper the OCTIS people wrote OCTIS: Comparing and Optimizing Topic models is Simple!!. coherencemodel import CoherenceModel topic_model = BERTopic( verbose = True, n_gram_range =( 1, 3)) Topic_model = BERTopic(verbose=True, embedding_model=embedder, n_gram_range=(1,1), calculate_probabilities=True) I get the coherence value, that in this case was 0.1725 for 'c_v', -0.2662 for c_npmi, and -8.5744 for u_mass.įrom bertopic import BERTopic import gensim. When I considere n_gram_range=(1,1) like this My Bertopic model got topics with ngrams from 1 to 10 and the tokenizer here got tokens with only one term (1-gram). Hello MaartenGr, I tried to execute this, but the problem is the tokenizer. # Evaluate coherence_model = CoherenceModel( topics = topic_words,Ĭoherence = coherence_model. Topic_words = įor topic in range( len( set( topics)) - 1)] # Extract features for Topic Coherence evaluation words = vectorizer. ![]() # Extract vectorizer and tokenizer from BERTopic vectorizer = topic_model. coherencemodel import CoherenceModel # Preprocess documents cleaned_docs = topic_model.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |