mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Anil <robin.a...@gmail.com>
Subject Vowpal Wabbit 5.0 with optimisations
Date Thu, 23 Dec 2010 18:46:24 GMT
Worth a look, maybe some new ideas for sgd package

Robin

http://hunch.net/?p=1594

Vowpal Wabbit, version 5.0, and the second heresy <http://hunch.net/?p=1594>
Tags: Announcements <http://hunch.net/?cat=4>, Code<http://hunch.net/?cat=42>
, Machine Learning <http://hunch.net/?cat=29> — jl <http://hunch.net/~jl>@
9:52 pm

I’ve released version 5.0 <http://hunch.net/~vw/vowpal_wabbit_v5.0.tar.gz> of
the Vowpal Wabbit <http://hunch.net/~jl> online learning software. The major
number has changed since the last release <http://hunch.net/?p=1068>because
I regard all earlier versions as obsolete—there are several new algorithms &
features including substantial changes and upgrades to the default learning
algorithm.

The biggest changes are new algorithms:

   1. Nikos <http://www.cs.cornell.edu/~nk/> and I improved the default
   algorithm. The basic update rule still uses gradient descent, but the size
   of the update is carefully controlled so that it’s impossible to overrun the
   label. In addition, the normalization has changed. Computationally, these
   changes are virtually free and yield better results, sometimes much better.
   Less careful updates can be reenabled with –loss_function classic, although
   results are still not identical to previous due to normalization changes.
   2. Nikos also implemented the per-feature learning rates as per
these two<http://www.colt2010.org/papers/104mcmahan.pdf>
    papers <http://www.colt2010.org/papers/023Duchi.pdf>. Often, this works
   better than the default algorithm. It isn’t the default because it isn’t
   (yet) as adaptable in terms of learning rate decay. This is enabled with
   –adaptive and learned regressors are compatible with the default.
   Computationally, you might see a factor of 4 slowdown if using ‘-q’. Nikos
   noticed that the phenomenal quake inverse square
root<http://en.wikipedia.org/wiki/Fast_inverse_square_root> hack
   applies making this substantially faster than a naive implementation.
   3. Nikos and Daniel <http://cseweb.ucsd.edu/~djhsu/> also implemented
   active learning derived from this
paper<http://cseweb.ucsd.edu/~djhsu/papers/iwal-cal.pdf>,
   usable via –active_simulation (to test parameters on an existing supervised
   dataset) or –active_learning (to do the real thing). This runs at full speed
   which is much faster than is reasonable in any active learning scenario. We
   see this approach dominating supervised learning on all classification
   datasets so far, often with far fewer labeled examples required, as the
   theory predicts. The learned predictor is compatible with the default.
   4. Olivier <http://research.yahoo.com/Olivier_Chapelle> helped me
   implement preconditioned conjugate gradient based on Jonathan
Shewchuk<http://www.cs.berkeley.edu/~jrs/>
   ’s tutorial<http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf>.
   This is a *batch* algorithm and hence requires multiple passes over any
   dataset to do something useful. Each step of conjugate gradient requires 2
   passes. The advantage of cg is that it converges relatively quickly via the
   use of second derivative information. This can be particularly helpful if
   your features are of widely differing scales. The use of –regularization
   0.001 (or smaller) is almost required with –conjugate_gradient as it will
   otherwise overfit hard. This implementation has two advantages over the
   basic approach: it implicitly computes a Hessian in *O(n)* time where *n* is
   the number of features and it operates out of core, hence making it
   applicable to datasets that don’t conveniently fit in RAM. The learned
   predictor is compatible with the default, although you’ll notice that a
   factor of 8 more RAM is required when learning.
   5. Matt Hoffman <http://www.cs.princeton.edu/~mdhoffma/> and I
   implemented Online Latent Dirichlet
Allocation<http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf>.
   This code is still experimental and likely to change over the next week. It
   really does a minibatch update under the hood. The code appears to be
   substantially faster than Matt’s earlier python implementation making this
   probably the most efficient LDA anywhere. LDA is still much slower than
   online linear learning as it is quite computationally heavy in
   comparison—perhaps a good candidate for GPU optimization.
   6. Nikos, Daniel, and I have been experimenting with more online cluster
   parallel learning algorithms (–corrective, –backprop, –delayed_global). We
   aren’t yet satisfied with these although they are improving. Details are at
   the LCCC workshop <http://lccc.eecs.berkeley.edu/>.

In addition, Ariel <http://www.yendor.com/> added a test suite,
Shravan<http://research.yahoo.com/Shravan_Narayanamurthy> helped
with ngrams, and there are several other minor new features and bug fixes
including a very subtle one caught by Vaclav<http://www.occamslab.com/petricek/>
.

The documentation on the website hasn’t kept up with the code. I’m planning
to rectify that over the next week, and have a new tutorial starting at 2pm
in the LCCC <http://lccc.eecs.berkeley.edu/schedule.html> room for those
interested. Yes, I’ll not be skiing [image: :)]

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message