cluster analysis - Evaluation of k-means clustering

I'm doing a project about twitter users sentiment analysis right now. I'm using K-Means algorithm to cluster the tweets into 3 clusters, positive, negative and neutral. But I'm still confused about the evaluation in my project. Do you guys have any recommendation of what method or algorithm that I should use to evaluate the cluster or the performance of my sentiment analysis? Sorry for my poor English.Thank you....Read more

cluster analysis - Applying vector based clustering algorithms to social network context

i have a social network described as edges in a file. I used graph based clustering algorithms to find dense parts of the graph. However there is also vector based clustering which i need to apply to the data i have, but i can not find any context to this. I have also information about each node considering their features. I think using vectors containing the features of each user makes no sense here. For example k-Means would calculate the distance between user u1 with his feature vector v1 = [f1,f2,f3,..] and user u2 with its feature vector v...Read more

cluster analysis - Determining canonical classes with text data

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type: aple, apples, appls, ornge, fruits, orange, orange z, pear, cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.Put another way I have multiple spellings of various permutations of a parent lev...Read more

cluster analysis - ELKI Kmeans clustering Task failed error for high dimensional data

I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error.Task No data type found satisfying: NumberVector,field AND NumberVector,variableAvailable types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation( at de.lmu.ifi.dbs.elki.algorith...Read more

cluster analysis - doc2vec clustering n*n similarity between documents

I have a set of document vectors generated using gensim doc2vec (~500K vectors of 150 dimensions). I wish to cluster similar documents for which i want to generate a n*n similarity matrix over which i can run my clustering algorithm.I tried instructions of this link using the gensim.similarities but the output for 500k records was 500k*150 matrix. I dont understand the output. Shouldn't it be 500k * 500k ? am i missing something?...Read more

cluster analysis - After loading a pretrained Word2Vec model, how do I get word2vec representations of new sentences?

I loaded a word2vec model using Google News dataset. Now I want to get the Word2Vec representations of a list of sentences that I wish to cluster. After going through the documentation I found this gensim.models.word2vec.LineSentencebut I'm not sure this is what I am looking for. There should be a way to get word2vec representations of a list of sentences from a pretrained model right? None of the links I searched had anything about it. Any leads would be appreciated....Read more

cluster analysis - Which machine learning library to use

I am looking for a library that, ideally, has the following features:implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix)implements support vector machinesis in C++is somewhat documented (this one seems to be hardest)I would like this to be in C++, as I am most comfortable with that language, but I will also use any other language if the library is worth it. I have googled and found some, but I do not really have the time to try them all out, so I want hear what other people had for experiences...Read more

Cluster assignment remapping

I have test classification datasets from UCI Machine Learning repository which are labelled.I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly...Read more

data science - Visualizing clusters using TSNE

I have a dataset which I need to cluster and display in a way wherein elements in the same cluster should appear closer together. The dataset is based out of a research study, and has around 16 rows(entries) and about 50 features. I do agree that its not an ideal dataset to begin with, but unfortunately thats is the situation on hand.Following is the approach I took:I first applied KMeans on the dataset after normalizing it.In parallel I also tried to use TSNE to map the data into 2 dimensions and plotted them on a scatterplot. From my understa...Read more

data visualization - How do I visualise clusters of users?

I have an application in which users interact with each-other. I want to visualize these interactions so that I can determine whether clusters of users exist (within which interactions are more frequent).I've assigned a 2D point to each user (where each coordinate is between 0 and 1). My idea is that two users' points move closer together when they interact, an "attractive force", and I just repeatedly go through my interaction logs over and over again.Of course, I need a "repulsive force" that will push users apart too, otherwise they will a...Read more

cluster analysis - Consensus Clustering Plus in R

I've got a dilemma on deciding an optimal number of cluster from my dataset.I have obtained ConsensusClusterPlus heatmap plots but not really sure on deciding on the optimal number of clusters. Can anyone help in this? I've attached pictures of heatmaps and plots to decide on optimal number of clusters. Seems like k=3 is an optimal number click here to see pics pic 6, pic 4, pic 2, pic 3, pic 7, pic 8, pic 9, pic 10...Read more