mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: hashing trick in SGD classifier
Date Sat, 12 Mar 2011 00:15:54 GMT
OLR will work.  The dense parameter matrix is sized after hashing.
 Typically this is 10^4 to 10^6 by (k-1) where k is the number of
categories.  With 2 or 3 probes, 10^5 has worked very well for me with
fairly enormous key spaces.  With one probe, I have heard that 2-3 x 10^6
elements are required to get good performance.

Take a look at org.apache.mahout.vectorizer.encoders<>

Note that having sparse feature vectors is not incompatible with having a
dense model parameter matrix.

While you are at it, take note of
 which uses multiple
 's to optimize learning rates.  This lets you amortize the input and
encoding costs across many learners.  This also allows you to not waste
adaption effort on loser values of the hyper-parameters.  On the side, it
also decreases the dimension of the hyper-parameters so that the adaption
goes faster as well.

I hate to self-promote, but there is a lot of material on this in the last
section of the MiA book.

On Fri, Mar 11, 2011 at 2:48 PM, Segal, Alec R. <
> wrote:

> Hi,
> What is the best way to implement the feature+classes hashing trick in the
> SGD classifier (similar to "Hash kernels" or 'Feature hashing for large
> Scale Multitask Learning" by John Langford)?
> OnlineLogisticRegression uses the dense parameter matrix - it would not
> work for me:
> beta = new DenseMatrix(numCategories - 1, numFeatures);
> I have a large scale text classification problem (large number of features
> and classes) - and want to use ngram features. What would be the best way to
> do (online) text encoding /training in mahout?
> I've seen some postings about hashing trick being implemented in LDA in
> Mahout - could it be reused?
> In one of the scenarios the number of classes and features are not known in
> advance and could grow with the data size.
> Thank you,
> Alec Segal

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message