### vowpalwabbit - Categorical Features in Vowpal Wabbit

This link says that currently all feature labels must be followed by a float. But when I enter -1 3 |context day:Monday in this validator, it accepts it as day as a feature with value Monday.Further, If I can provide strings as values to a feature, how can I provide values which contain spaces.For example -1 3 |context day:Monday name:A B keeps only A as the value to the label name, and treats B as another label. But, in actual, I want to assign the label name the value "A B"...Read more

### min samples per categorical value in catboost

How can I tell CatBoost to group together categorical values with little samples. For example, let's say I have a column called Country which has only 1 sample for 'Cambodia' and 2 samples for 'Mongolia' and 999,998 other countries each of each has at least 100 samples. I would like to tell CatBoost to not bother doing it's CTR magic on those rare countries but just treat those as "other"....Read more

### dataset - Sorting and merging in Stata on categorical variables

I am in the process of merging two data sets together in Stata and came up with a potential concern. I am planning on sorting each data set in exactly the same manner on several categorical variables that are common to both sets of data. HOWEVER, several of the categorical variables have more categories present in one data set over the other. I have been careful enough to ensure that the coding matches up in both data sets (e.g. Red is coded as 1 in both data set A and B, but data set A has only Red, Green and Blue whereas data set B has Red, G...Read more

### Categorical variable in WinBUGS, OpenBUGS

I am using a script from the following paper (Zipkin, E.F., Royle, J.A., Dawson, D.K., Bates, S., 2010. Multi-species occurrence models to evaluate the effects of conservation and management actions. Biological Conservation 143, 479-484) to estimate bird species occupancy. One of my variables in the detection estimate (the K loop in the code below) is Wind, which is a categorical variable, with levels 1-6. I have attempted to use the dcat function in OpenBUGS which what I hope is an uniformative prior (beta(1,1)), but OpenBUGS fails with error:...Read more

### pca - Can principal component analysis be applied to datasets containing a mix of continuous and categorical variables?

I have a dataset that has both continuous and categorical data. I am analyzing by using PCA and am wondering if it is fine to include the categorical variables as a part of the analysis. My understanding is that PCA can only be applied to continuous variables. Is that correct? If it cannot be used for categorical data, what alternatives exist for their analysis?...Read more

### categorical data - Interpreting the entropy of a Dirichlet distribution

I was looking for a measure to interpret the "spikiness" of categorical histograms. So, if it becomes unnaturally skewed towards a certain value at a given time, I want a metric that will show some kind of a spike at that time. I considered a variety of metrics for this purpose and finally settled on the Entropy of a Dirichlet distribution (considering the histogram of counts as a sample from a Dirichlet and using the corresponding Entropy as my metric). For this, I used the formula for Entropy in the Wikipedia article . H = logB(\alpha) + (\...Read more

### Convert numerical variable to categorical and group

I have a variable with patients ages. I have 180 values with ages ranging from 18 to 92 years.I want to use this variable as a factor with three levels: a: ages from 18-57 b: ages from 58-68 c: ages from 69-92I typed:AGE.factor=cut(AGE, breaks=c(18:57,58:68,69:92))but the response i get is: str(AGE.factor) Factor w/ 74 levels "(18,19]","(19,20]",..: 44 44 44 44 44 44 50 50 50 28 ...We did that happen???i only want 3 levels of my variable with the ages grouped.Thanks...Read more

### categorical data - Repeated-measure contingency table with two variables with many levels

I am trying to analyze the results of a survey on what would make people of different ages visit certain tourist places.However, age was collected as age-group (not my fault). Hence, it is a categorical variable.I have two variables: AGE GROUP (5 levels) and PREFERENCE (8 levels).However, participants could indicate more than one preference. Hence, PREFERENCE is a repeated-measure variable (i.e. one participant could contribute to two values, e.g. PREFERENCE1 and PREFERENCE2).My table looks as belowI tought I could use a contingency table to an...Read more

### gam - Which level of factor (categorical variable) is assigned "the parameter value"?

Here is an example in Zuur's book about GAM. The data file (Bailey fisheries data) and R code (Chapter2.R) can be found in his website. Below I describe what I confuse.We want to exam whether the fish density/depth relationship is different in the two time periods. The two periods are: (1) 1979 ~ 1989; (2) 1997 ~ 2002. The GAM model is:$Dens_{i} = \alpha + f(Depth_{i}) + \beta_{1}*Period_{i} + \epsilon_{i}$where $Dens_{i}$ is fish density at site i, $Depth_{i}$ is the depth at site i.The R code is:library(mgcv)DF <- structure(list(Dens = c(0...Read more

### categorical data - Check if dropout rates are independent for an interaction of two independent variables (one with a large amount of levels)

I am trying to analyse dropout rates in an experiment, but there are multiple issues which collide, and I don't know how to deal with them as a whole. First, find a list of those issues. Below, see a more elaborate description and a fictional example for illustration (which involves cakes!). I am using R for my analysis, but appreciate any theoretical input.IssuesA combination of two independent variables (stimulus $\times$ condition).... where one of the variables has a large number of levels (100 stimuli).... where there might be (?) dependen...Read more

### categorical data - How to compare frequencies among groups?

I want to compare the frequency of a categorical value among 4 groups, what statistical test should I use? (I am using SPSS).I am looking for a statistical test that would allow me to say: the frequency of value "V" depends on the group and the groups' frequencies are statistically different for that value.I write here an example: Group 1 shows value "V" 10 times, Group 2 shows value "V" 15 times... Group 4 shows value "V" 40 times.Thanks a lot in advance....Read more

### categorical data - compare subjective participants choices

I am categorising photographs into one of three categories based on a subjective appraisal of the photo contents.I wanted to test how 'subjective' the method is by asking 7 other people to classify 100 images and compare their decisions to mine.71% of the images received 'mostly' the same category i.e. at least 5 people out of the 7 agree on the category.I personally dont think this is high enough agreement to rely on this method, but is there a way of testing this statistically?...Read more

### categorical data - What does it mean to have "groups" and "levels" of variables?

Using this website, a user can find a correct statistical test for their project.However, what do they mean when they write "2+ groups" or "2+ levels"?...Read more

### Visualize categorical data for two variables for 20 observations

I have several thousand records. In these records, there are three fields of interest:Location - 20 possible choicesRace - 5 possible choicesPurchase type - 3 possible choicesI want to show the count of the possible combinations for each location on a single graphic. For example, Location 1 will have 15 possible choices (three purchase types for each of the five race choices) and the first choice might be a count of 23 observations for (1) Citytown (2) Black (3) New Purchase. The next data point could be 14 observations of (1) Citytown (2) Blac...Read more

### categorical data - Features that correspond to rare events: how rare is "too rare" to be informative?

I am working with 82 binary features constructed from six categorical features. I have about 1,600 observations. Some of these features correspond to extremely rare categories. Some of them have only one or two 1's in the entire sample; others have in the teens, and a few others have in the 20s and 30s.There is a lot of natural variation in the data, so these very rare 1's are not likely to be informative. But how rare is "too rare?" The categories that appear only twice in 1,600 observations are probably too rare. But is 20 appearances still ...Read more