TopicExplorer is designed as middleware that connects machine learning and topic inference with databases and visual web-based user interfaces. It can be easily adapted to very different application domains through a novel workflow based plug-in-mechanism. The system stores the training data of the topic model, the inference output and additional data depending on the application. It is designed to scale to very large data. Different data stores can be mixed to give optimal performance, e.g. different types of SQL and No-SQL databases. Currently, we develop an ecosystem of micro-services around the TopicExplorer core-application. This includes services for data-import like a blog-crawler, corpus configuration and topic model tuning. The goal is to provide a functionally complete, sustainable, web-based self-service around topic modeling for non-technical end users like researchers from the humanities and social sciences.
The source code of TopicExplorer and most of the accompanying services is hosted on github.com under GNU Affero General Public License v3.0.
Reports about bugs and other issues, feature requests and support with programming and documentation are welcome.
Natural Language Processing, Machine Learning and Databases
Documents texts are preprocessed using the part-of-speech (POS) tagging software treetagger for German and English and mecab for Japanese. The topic-model inference is done with mallet. All results from NLP-POS tagging and topic-modeling are stored in a relational databases. During preprocessing a lot of additional information is derived from those results mainly using SQL.