As you point out, there are two ways to encode categorical variables. One
way has (n1) binary variables for a categorical variable with n possible
values. The other has n binary variables.
The singularity problem that you mention definitely occurs. It comes from
the fact that we now have (n+1) variables with, effectively, only n
constraints. With no other information, the problem becomes
underdetermined which leads to singularity in the numerical solution if
you use a second order method or unbounded wandering if you are using
stochastic gradient descent.
In large problems, however, the problem of having lots of variables for
only limited amounts of data is pretty ubiquitous. In fact, it is common
to have more variables than observations, possibly vastly more. It is also
common that many of these variables are essentially restatements of other
variables as well.
This means several things.
1  using direct (1 of n) encoding or contrast encoding (1 of n1) has no
difference regarding the underdetermined nature of the problem
2  you have to use some method for dealing with underdetermined systems
to deal with the too many variables, too little data problem and variable
selection isn't going to work
3  you have to build in solutions for colinearity as well.
The answer here is to use some kind of regularization. For logistic
regression, I strongly recommend that you try out L1 (the Lasso technique)
or a combination of L1 and L2 (elastic band) regularization.
In R, the best library I have found for this is glmnet. One particular
benefit of glmnet is that it handles sparse matrices well.
The SGD implementation in Mahout also supports L1 or L1+L2 regularization
quite easily. I wouldn't call that implementation state of the art, but it
may do the job for you. If your problem will fit into glmnet, that is a
great option. If it is too large for R, consider H2O's solvers.
On Mon, Sep 22, 2014 at 11:43 AM, Aymen J <ay.j@hotmail.fr> wrote:
> Hi List,
> I'm using Mahout Logistic Regression for a prediction task. As a test, I
> try the classification task with one single feature, a categorical one with
> 26 levels.
> When I run the Logistic regression on R or Python, I expect 25
> coefficients (corresponding to 25 out of the 26 levels, due to the
> "contrast coding") + the intercept. However, when I run it on Mahout, I
> have 26 coefficients + the intercept. Is there any way to force the
> contrast coding on Mahout (i.e. consider one of the level as the default
> level)? Isn't there a risk of matrix singularity by considering the 26
> levels in the logistic regression?
> Let me know if it's not clear.Thanks in advance for your answers,
> Aymen
