I'm doing a project about twitter users sentiment analysis right now. I'm using K-Means algorithm to cluster the tweets into 3 clusters, positive, negative and neutral. But I'm still confused about the evaluation in my project. Do you guys have any recommendation of what method or algorithm that I should use to evaluate the cluster or the performance of my sentiment analysis? Sorry for my poor English.Thank you....Read more

I got a data set wit nominal, ordinal and metric variables. I want to perform a cluster analysis, since I have mixed scales it seems that using k-modes clustering is the most appropriate way to explore the data.Or has anyone a better way in mind? I am thanksful for any advices!...Read more

i have a social network described as edges in a file. I used graph based clustering algorithms to find dense parts of the graph. However there is also vector based clustering which i need to apply to the data i have, but i can not find any context to this. I have also information about each node considering their features. I think using vectors containing the features of each user makes no sense here. For example k-Means would calculate the distance between user u1 with his feature vector v1 = [f1,f2,f3,..] and user u2 with its feature vector v...Read more

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type: aple, apples, appls, ornge, fruits, orange, orange z, pear, cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.Put another way I have multiple spellings of various permutations of a parent lev...Read more

I have a 60000 documents which i processed in gensim and got a 60000*300 matrix. I exported this as a csv file. When i import this in ELKI environment and run Kmeans clustering, i am getting below error.Task failedde.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variableAvailable types: DBID DoubleVector,variable,mindim=266,maxdim=300 LabelList at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126) at de.lmu.ifi.dbs.elki.algorith...Read more

I have a set of document vectors generated using gensim doc2vec (~500K vectors of 150 dimensions). I wish to cluster similar documents for which i want to generate a n*n similarity matrix over which i can run my clustering algorithm.I tried instructions of this link https://github.com/RaRe-Technologies/gensim/issues/140 using the gensim.similarities but the output for 500k records was 500k*150 matrix. I dont understand the output. Shouldn't it be 500k * 500k ? am i missing something?...Read more

I loaded a word2vec model using Google News dataset. Now I want to get the Word2Vec representations of a list of sentences that I wish to cluster. After going through the documentation I found this gensim.models.word2vec.LineSentencebut I'm not sure this is what I am looking for. There should be a way to get word2vec representations of a list of sentences from a pretrained model right? None of the links I searched had anything about it. Any leads would be appreciated....Read more

I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?...Read more

I am looking for a library that, ideally, has the following features:implements hierarchical clustering of multidimensional data (ideally on similiarity or distance matrix)implements support vector machinesis in C++is somewhat documented (this one seems to be hardest)I would like this to be in C++, as I am most comfortable with that language, but I will also use any other language if the library is worth it. I have googled and found some, but I do not really have the time to try them all out, so I want hear what other people had for experiences...Read more

I have test classification datasets from UCI Machine Learning repository which are labelled.I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly...Read more

I have a dataset which I need to cluster and display in a way wherein elements in the same cluster should appear closer together. The dataset is based out of a research study, and has around 16 rows(entries) and about 50 features. I do agree that its not an ideal dataset to begin with, but unfortunately thats is the situation on hand.Following is the approach I took:I first applied KMeans on the dataset after normalizing it.In parallel I also tried to use TSNE to map the data into 2 dimensions and plotted them on a scatterplot. From my understa...Read more

I have an application in which users interact with each-other. I want to visualize these interactions so that I can determine whether clusters of users exist (within which interactions are more frequent).I've assigned a 2D point to each user (where each coordinate is between 0 and 1). My idea is that two users' points move closer together when they interact, an "attractive force", and I just repeatedly go through my interaction logs over and over again.Of course, I need a "repulsive force" that will push users apart too, otherwise they will a...Read more

There are a lot of validity index for clustering, but just for numeric data. what about clustering for mixed data (numeric and categorical) ?...Read more

I have a dataset as a table with 15 columns. I would like to use cluster analisys for getting new knowledges. But I do not know which params use in analysis.I understand, I should to determine parameters most affecting the data. Which of statistics methods I should apply to my data?Please any examples, ideas, books, methods, procedures...Read more

I've got a dilemma on deciding an optimal number of cluster from my dataset.I have obtained ConsensusClusterPlus heatmap plots but not really sure on deciding on the optimal number of clusters. Can anyone help in this? I've attached pictures of heatmaps and plots to decide on optimal number of clusters. Seems like k=3 is an optimal number click here to see pics pic 6, pic 4, pic 2, pic 3, pic 7, pic 8, pic 9, pic 10...Read more