mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: hashing trick in SGD classifier
Date Sat, 12 Mar 2011 00:15:54 GMT
OLR will work.  The dense parameter matrix is sized after hashing.
 Typically this is 10^4 to 10^6 by (k-1) where k is the number of
categories.  With 2 or 3 probes, 10^5 has worked very well for me with
fairly enormous key spaces.  With one probe, I have heard that 2-3 x 10^6
elements are required to get good performance.

Take a look at org.apache.mahout.vectorizer.encoders<https://builds.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/encoders/package-summary.html>

Note that having sparse feature vectors is not incompatible with having a
dense model parameter matrix.

While you are at it, take note of
AdaptiveLogisticRegression<https://builds.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/classifier/sgd/AdaptiveLogisticRegression.html>
 which uses multiple
CrossFoldLearner<https://builds.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/classifier/sgd/CrossFoldLearner.html>
 's to optimize learning rates.  This lets you amortize the input and
encoding costs across many learners.  This also allows you to not waste
adaption effort on loser values of the hyper-parameters.  On the side, it
also decreases the dimension of the hyper-parameters so that the adaption
goes faster as well.

I hate to self-promote, but there is a lot of material on this in the last
section of the MiA book.

On Fri, Mar 11, 2011 at 2:48 PM, Segal, Alec R. <Alec.R.Segal@supermedia.com
> wrote:

> Hi,
>
> What is the best way to implement the feature+classes hashing trick in the
> SGD classifier (similar to "Hash kernels" or 'Feature hashing for large
> Scale Multitask Learning" by John Langford)?
>
> OnlineLogisticRegression uses the dense parameter matrix - it would not
> work for me:
>
> beta = new DenseMatrix(numCategories - 1, numFeatures);
> I have a large scale text classification problem (large number of features
> and classes) - and want to use ngram features. What would be the best way to
> do (online) text encoding /training in mahout?
> I've seen some postings about hashing trick being implemented in LDA in
> Mahout - could it be reused?
> In one of the scenarios the number of classes and features are not known in
> advance and could grow with the data size.
>
> Thank you,
> Alec Segal
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message