Situation:I have a numpy term-document matrix example: [[0,1,0,0....],....[......0,0,0,0]].I have plugged in the above matrix to the ldamodel method of the gensim. And it is working fine with the lad method lda = LdaModel(corpus, num_topics=10). corpus is my term-document matrix mentioned above.I needed two intermediate matrices( topic-word array & document-topic array) for research purpose. 1) per document-topic probability matrix (p_d_t) 2) per topic-word probability matrix (p_w_t)Question:How to get those array from the gensim LdaMod...Read more

Do you guys have any suggestion for the way that I could possibly subdivide documents into sentences before training MALLET LDA? Thank you in advance...Read more

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned probabilities.What I don't understand is, if the above...Read more

I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't got complete understanding of the various calculations that goes in it. I am wondering can someone show me the calculations using a very small corpus (let say of 3-5 sentences and 2-3 topics)....Read more

I am using mallet from a scala project. After training the topic models and got the inferencer file, I tried to assign topics to new texts. The problem is I got different results with different calling methods. Here are the things I tried:creating a new InstanceList and ingest just one document and get the topic results from the InstanceListsomecontentList.map(text=>getTopics(text, model))def getTopics(text:String, inferencer: TopicInferencer):Array[Double]={ val testing = new InstanceList(pipe) testing.addThruPipe(new Instance(text, n...Read more

I have a question when using topic models like pLSA/LDA: how to inference the topic distribution of a new document after we got the distribution for each words in each topics? I have tried "fold-in" Gibbs Sampling when using LDA, but when the unseen document is very short this method doesn't work because the randomness assignment of the topic to each words contained in the document. For example, considering a model with two topics, there's a token w which p(w|z1)=0.09 and p(w|z2) = 0.01. Then a document which contains only one word w, it's p(z|...Read more

I want to get the topic coherence for the LDA model. Let's say I have two LDA model one with a bag of words and the second one with a bag of phrases. how I can get the coherence for these two models and then compare them on the basis of coherence?...Read more

I want to create a LDA topic model and am using SpaCy to do so, following a tutorial. The error I receive when I try to use spacy is one I cannot find on google, so I'm hoping someone here knows what it's about.I'm running this code on Anaconda:import numpy as npimport pandas as pdimport re, nltk, spacy, gensim# Sklearnfrom sklearn.decomposition import LatentDirichletAllocation, TruncatedSVDfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom sklearn.model_selection import GridSearchCVfrom pprint import pprint# Plott...Read more

I'm new to LDA and topic modeling and I would like to understand the inference mechanism.I would like to apply LDA on activity recognition.Say that I have defined 10 topics composed by a probability distribution of events.for example TOPIC_1 = event1 (0.5), event2 (0.4), event3 (0.0), event4 (0.0) and event5 (0.1).I would like to uderstand wich topics are active across the day of a person.One day of a person is composed by a sequence of events sampled every minutes.What I'm doing to see wich topic is active is:1) select 1 hour window in the dai...Read more

I am relatively new to mallet and need to know:- are the words in each topic that mallet produces rank ordered in some way?- if so, what is the ordering (i.e.) is 1st in a topic list the one with the highest distribution across the corpus?Thanks!...Read more

I am trying to extend LDA model by adding another layer of locations.Is it possible to add another layer to Mallet? if so, which classes should I extend?The process I'm trying to model:1. Choose a region2. Choose a topic3. Choose a word...Read more

I run the file simple lda.java and I got exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0 at cc.mallet.topics.SimpleLDA.main(SimpleLDA.java:560)...Read more

I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result? Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?...Read more

I've been experimenting with LDA topic modelling using Gensim. I couldn't seem to find any topic model evaluation facility in Gensim, which could report on the perplexity of a topic model on held-out evaluation texts thus facilitates subsequent fine tuning of LDA parameters (e.g. number of topics). It would be greatly appreciated if anyone could shed some light on how I can perform topic model evaluation in Gensim. This question has also been posted on metaoptimize....Read more

Thanks for reading and taking the time to think about and respond to this.I am using Gensim's wrapper for Mallet (ldamallet.py), and it works like a charm. I need to get the topic proportions for my corpus (over all my documents) and I do not know how to do that. model.alpha is not it as it is not normalized to 1. Plus, alpha contains my Dirichlet parameters, and not the topic proportions. Am I correct?Any help is much appreciated....Read more