I am doing a correlation studying. I have a Multiple Response which can have more than a answer per case.The multiple response (q06_*) is a question about the kinds of transport used. The case could have chosen more than one possibility.How is possible to make a bivariate correlations with this variable sets with an other variable (a score)?...Read more

Pybossa is a very good crowdsourcing framework that can be used for getting the responses on volunteer basis. They have the statistical analysis tool Enki that can be used for analysis of the results that we get as responses to our experiment. Is there a way to export Pybossa results to Enki so that the results can be analyzed in real-time?...Read more

Given a two variables with the same number of observations, you will apparently see they follow three linear regressions in the scatter plot. How could you seperate them into three groups with different linear fittings?...Read more

I would like to know if all heuristic approaches, but concretely UPGMA or affinity propagation may provide different results in repeated analyses if the groups are not highly defined. I mean, as heuristic approaches are practical methods that cannot ensure an optimal, it is possible that in each repeated analyses we could obtain different solutions if there is no clear optimum, is it right?Therefore I would like to confirm that this may happen for all heuristic approaches. Thanks in advance...Read more

I am using a Naive Bayes algorithm to predict movie ratings as positive or negative. I have been able to rate movies with 81% accuracy. I am, however, trying to assign a 'confidence level' for each of the ratings as well.I am trying to identify how I can tell the user something like "we think that the review is positive with 80% confidence". Can someone help me understand how I can calculate a confidence level to our classification result?...Read more

I was wondering how one would calculate the pointwise mutual information for text classification. To be more exact, I want to classify tweets in categories. I have a dataset of tweets (which are annotated), and I have a dictionary per category of words which belong to that category. Given this information, how is it possible to calculate the PMI for each category per tweet, to classify a tweet in one of these categories....Read more

What are some of the deciding factors to take into consideration when choosing a similarity index. In what cases is a Euclidean Distance preferred over Pearson and vice versa?...Read more

I'm new to NLP and text mining and just heard about Zipf's Law. I've somewhat understood its explanation through the Wikipedia page on the topic.Can anyone explain to me with a simple graph/example/code what is happening....Read more

Below I've included data from a PEW research study. What is the method for combining probabilities to reach a composite for say: an 18 year old black male?...Read more

I am new in machine learning. I did a test but do not know how to explain and evaluate.Case 1:I first divide randomly the data (data A, about 8000 words) into 10 groups (a1..a10). Within each group, I use 90% of data to build ngram model. This ngram model is then tested on the other 10% data of the same group. The result is below 10% accuracy. Other 9 groups are done same way (respectively build model and respectively tested on the remained 10% data of that group). All results are about 10% accuracy. (Is this 10 fold cross-validation?)Case 2:I ...Read more

In WinBUGS, I am specifying a model with a multinomial likelihood function, and I need to make sure that the multinomial probabilities are all between 0 and 1 and sum to 1. Here is the part of the code specifying the likelihood:e[k,i,1:9] ~ dmulti(P[k,i,1:9],n[i,k]) Here, the array P[] specifies the probabilities for the multinomial distribution.These probabilities are to be estimated from my data (the matrix e[]) using multiple linear regressions on a series of fixed and random effects. For instance, here is the multiple linear regression used...Read more

I want to run a mediation analysis to see the effect of Exposure to a pollutant (continuous) to types of Cancer (categorical with 4 levels-types of cancer) via a Blood biomarker as the mediator (continuous). So the mediation diagram would be something like this:E -> B -> CFor the mediation variable I run the linear regression analysis:med.fit <- lm(blood_biomarker~exposure+age+sex, data=demographics)but when it comes to the outcome variable, I read from the docs that the only appropriate analysis is multinomial regression analysis such as:ou...Read more

I'm working on a program that takes in several (<50) high dimension points in feature space (1000+ dimensions) and performing hierarchical clustering on them by recursively using standard k-clustering.My problem is that in any one k-clustering pass, different parts of the high dimensional representation are redundant. I know this problem follows under the umbrella of either feature extraction, selection, or weighting.In general, what does one take into account when selecting a particular feature extraction/selection/weighting algorithm? And ...Read more

The question may be vague but I will try to word it as best as possible.So I came up with a crude algorithm to compute whether a sentence (part of a review snippet) is positive or negative or neutral (let's call this EQ for the sentence). So for 5 sentences I have some ratings for sentence based on [-100, 100]. The review has to be rated on [0, 5] basis(0, 39.88)(1, 73.07)(2, 69.65)(3, 51.43)(4, 76.74)The choice that I am struggling with is what method should I choose to now compute the overall rating for the review snippet. I researched a litt...Read more

I've read some papers regarding to non-iid data. Based on Wikipedia, I know what iid (independent and identical distributed) data is but am still confused about non-iid. I did some research but cannot find a clear definition and example of it. Can someone help me on this?...Read more