Publications
CFMf topic-model: comparison with LDA and Top2Vec
Description
Mining the content of scientific publications is increasingly used to investigate the practice of science and the evolution of research domains. Topic models, among which LDA (statistical bag-of-words approach) and Top2Vec (embeddings approach), have notably been shown to provide rich insights into the thematic content of disciplinary fields, their structure and evolution through time. However, improving topic modeling methods remains a major concern. Here we propose an alternative topic-modeling approach based on neural clustering and feature maximization with F1-measure (in short: CFMf). We compare the performance of this approach to LDA and Top2Vec by applying the methods to a reference corpus of full-text philosophy of science articles (N = 16,917). The results reveal significant improvements in terms of coherence measures, independently of the number of topics. Qualitative comparisons show an overall consistency in terms of topical coverage across all three methods, yet with differences: in particular, CFMf appears affected by the presence of a large class while Top2Vec generates some sets of top-words highly difficult to interpret. We discuss these results and highlight upcoming research work.
Référence
Lamirel, J.-C., Lareau, F. et Malaterre, C. (2024). CFMf topic-model: comparison with LDA and Top2Vec. Scientometrics, 1-19