Publications
Old but Not Obsolete: Bag-of-Words vs. Embeddings in Topic Modeling.
Description
Topic modeling techniques, including classical Bag-of-Words (BOW)-based methods like Latent Dirichlet Allocation (LDA) and emerging embedding-based models such as Top2Vec and BERTopic, are pivotal for uncovering latent themes in text corpora. This study builds upon previous work on an alternative BOW-based approach relying on feature maximization, CFMf, addressing limitations and extending comparisons along multiple metrics. Using a corpus of philosophy of science research articles (N=16,917), we evaluate LDA, CFMf, Top2Vec, and BERTopic across coherence, diversity, and recall metrics while also qualitatively examining top-word interpretability. Results reveal distinct trade-offs: while Top2Vec excels in coherence and diversity, it underperforms in recall and interpretability; BERTopic marginally outperforms LDA in coherence but not recall; CFMf balances these dimensions, outperforming others in coherence and diversity. Findings highlight the enduring relevance of BOW-based models and emphasize the modularity of topic modeling pipelines, advocating for hybrid approaches that integrate optimal components for improved performance.
Référence
Lamirel, J.-C., Lareau, F. et Malaterre, C. (2025). Old but Not Obsolete: Bag-of-Words vs. Embeddings in Topic Modeling. Dans S. Sargsyn, W. Glänzel, et G. Abramo (dir.), Proceedings. 20th International Conference on Scientometrics & Infometrics (vol. 2) (p. 2275-2282). Yerevan: ISSI.