gensim(1.0.1) Doc2Vec with google pretrained vectors

For gensim(1.0.1) doc2vec, I am trying to load google pre-trained word vectors instead of using Doc2Vec.build_vocab wordVec_google = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) model0 = Doc2Vec(size=300, alpha=0.05, min_alpha=0.05, window=8, min_count=5, workers=4, dm=0, hs=1) model0.wv = wordVec_google ##some other code model0.build_vocab(sentences=allEmails, max_vocab_size = 20000)but this object model0 can not be further trained with "labeled Docs", and can't infer vectors for d...Read more

gensim doc2vec train more documents from pre-trained model

I am trying to train with new labelled document(TaggedDocument) with the pre-trained model.Pretrained model is the trained model with documents which the unique id with label1_index, for instance, Good_0, Good_1 to Good_999And the total size of trained data is about 7000Now, I want to train the pre-trained model with new documents which the unique id with label2_index, for instance, Bad_0, Bad_1... to Bad_1211And the total size of trained data is about 1211The train itself was successful without any error, but the problem is that whenever I try...Read more

gensim - Is it possible to search for part the of text using word embeddings?

I have found successful weighting theme for adding word vectors which seems to work for sentence comparison in my case:query1 = vectorize_query("human cat interaction")query2 = vectorize_query("people and cats talk")query3 = vectorize_query("monks predicted frost")query4 = vectorize_query("man found his feline in the woods")>>> print(1 - spatial.distance.cosine(query1, query2))>>> 0.7154500319>>> print(1 - spatial.distance.cosine(query1, query3))>>> 0.415183904078 >>> print(1 - spatial.distance.cosi...Read more

gensim - Combining Doc2Vec sentences into paragraph vectors

In Gensim's Doc2Vec, how do you combine sentence vectors to make a single vector for a paragraph? I realise you can train on the entire paragraph, but it would obviously be better to train on individual sentences, for context, etc. (I think...?)Any advice or normal use case?Also, how would I retrieve sentence/paragraph vectors from the model?...Read more

doc2vec - Problems accessing docvectors with gensim

I'm trying to use gensim's (ver 1.0.1) doc2vec to get the cosine similarities of documents. This should be relatively simple, but I'm having problems retrieving the vector of the documents so I can do cosine similarity. When I try to retrieve a document by the label I gave it in training, I get a key error. For example, print(model.docvecs['4_99.txt']) will tell me that there is no such key as 4_99.txt.However if I print print(model.docvecs.doctags) I see things like this:'4_99.txt_3': Doctag(offset=1644, word_count=12, doc_count=1)So it appear...Read more

gensim - Extracting vectors from Doc2Vec

I am trying to extract the documents vector to feed into a regression model for prediction.I have fed around 1 400 000 of labelled sentences into doc2vec for training, however I was only able to retrieve only 10 vectors using model.docvecs.This is a snapshot of the labelled sentences I used to trained the doc2vec model:In : documents[0]Out: TaggedDocument(words=['descript', 'yet'], tags='0')In : documents[-1]Out: TaggedDocument(words=['new', 'tag', 'red', 'sparkl', 'firm', 'price', 'free', 'ship'], tags='1482534')These are the code used to trai...Read more

How to get the topic-word probabilities of a given word in gensim LDA?

As I understand, if i'm training a LDA model over a corpus where the size of the dictionary is say 1000 and no of topics (K) = 10, for each word in the dictionary I should have a vector of size 10 where each position in the vector is the probability of that word belongs to that particular topic, right? So my question is given a word, what is the probability of that word belongs to to topic k where k could be from 1 to 10, how do I get this value in the gensim lda model?I was using get_term_topics method but it doesn't output all the probabiliti...Read more

Updating training documents for gensim Doc2Vec model

I have an existing gensim Doc2Vec model, and I'm trying to do iterative updates to the training set, and by extension, the model.I take the new documents, and perform preproecssing as normal:stoplist = nltk.corpus.stopwords.words('english')train_corpus= []for i, document in enumerate(corpus_update['body'].values.tolist()): train_corpus.append(gensim.models.doc2vec.TaggedDocument([word for word in gensim.utils.simple_preprocess(document) if word not in stoplist], [i]))I then load the original model, update the vocabulary, and retrain:#### Or...Read more

Update gensim word2vec model

I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next time gives out some results. I am doing it like this:new_sentence = ['moscow', 'weather', 'cold']model.train(new_sentence)and its printing this as logs:2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features2014-03-01 16:46:58,211 : INFO : reached the end of...Read more

gensim - Finding concepts from a large corpus using Word embeddings

I am trying to find out new concepts in a Corpus from Konkani language.I had trained two models on 1) a domain specific corpus 2) on newspaper corpus.I have used Gensim word2vec to train the model however I am unable to get the terms of similar meaning on close proximity in vector space.The closes words show no relation of being synonym with each other. Their similarity is as good as just some random words.What am i doing wrong?...Read more

gensim - Word2Vec: Effect of window size used

I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well.I am using the following test to evaluate the accuracy of model...Read more

gensim - relationship between window size and the actual sentence length in word2vec

I am trying to run word2vec (skip-gram model implemented in gensim with a default window size of 5) on a corpus of .txt files. The iterator that I use looks something like this: class Corpus(object): """Iterator for feeding sentences to word2vec""" def __init__(self, dirname): self.dirname = dirname def __iter__(self): word_tokenizer = TreebankWordTokenizer() sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') text = '' for root, dirs, files in os.walk(self.dirname): for file...Read more

text classification - What is the difference between gensim LabeledSentence and TaggedDocument

Please help me in understanding the difference between how TaggedDocument and LabeledSentence of gensim works. My ultimate goal is Text Classification using Doc2Vec model and any classifier. I am following this blog!class MyLabeledSentences(object): def __init__(self, dirname, dataDct={}, sentList=[]): self.dirname = dirname self.dataDct = {} self.sentList = [] def ToArray(self): for fname in os.listdir(self.dirname): with open(os.path.join(self.dirname, fname)) as fin: ...Read more