mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Mahout > Logistic Regression
Date Fri, 04 Nov 2011 13:52:01 GMT
Space: Apache Mahout (
Page: Logistic Regression (

Edited by Grant Ingersoll:
h1. Logistic Regression (SGD)

Logistic regression is a model used for prediction of the probability of occurrence of an
event. It makes use of several predictor variables that may be either numerical or categories.

Logistic regression is the standard industry workhorse that underlies many production fraud
detection and advertising quality and targeting products.  The Mahout implementation uses
Stochastic Gradient Descent (SGD) to all large training sets to be used.

For a more detailed analysis of the approach, have a look at the thesis of Paul Komarek:

See MAHOUT-228 for the main JIRA issue for SGD.

h2. Parallelization strategy

The bad news is that SGD is an inherently sequential algorithm.  The good news is that it
is blazingly fast and thus it is not a problem for Mahout's implementation to handle training
sets of tens of millions of examples.  With the down-sampling typical in many data-sets, this
is equivalent to a dataset with billions of raw training examples.

The SGD system in Mahout is an online learning algorithm which means that you can learn models
in an incremental fashion and that you can do performance testing as your system runs.  Often
this means that you can stop training when a model reaches a target level of performance.
 The SGD framework includes classes to do on-line evaluation using cross validation (the CrossFoldLearner)
and an evolutionary system to do learning hyper-parameter optimization on the fly (the AdaptiveLogisticRegression).
 The AdaptiveLogisticRegression system makes heavy use of threads to increase machine utilization.
 The way it works is that it runs 20 CrossFoldLearners in separate threads, each with slightly
different learning parameters.  As better settings are found, these new settings are propagating
to the other learners.

h2. Design of packages

There are three packages that are used in Mahout's SGD system.  These include

* The vector encoding package (found in org.apache.mahout.vectorizer.encoders)

* The SGD learning package (found in org.apache.mahout.classifier.sgd)

* The evolutionary optimization system (found in org.apache.mahout.ep)

h3. Feature vector encoding

Because the SGD algorithms need to have fixed length feature vectors and because it is a pain
to build a dictionary ahead of time, most SGD applications use the hashed feature vector encoding
system that is rooted at FeatureVectorEncoder.

The basic idea is that you create a vector, typically a RandomAccessSparseVector, and then
you use various feature encoders to progressively add features to that vector.  The size of
the vector should be large enough to avoid feature collisions as features are hashed.

There are specialized encoders for a variety of data types.  You can normally encode either
a string representation of the value you want to encode or you can encode a byte level representation
to avoid string conversion.  In the case of ContinuousValueEncoder and ConstantValueEncoder,
it is also possible to encode a null value and pass the real value in as a weight.  This avoids
numerical parsing entirely in case you are getting your training data from a system like Avro.

Here is a class diagram for the encoders package:


h3. SGD Learning

For the simplest applications, you can construct an OnlineLogisticRegression and be off and
running.  Typically, though, it is nice to have running estimates of performance on held out
data.  To do that, you should use a CrossFoldLearner which keeps a stable of five (by default)
OnlineLogisticRegression objects.  Each time you pass a training example to a CrossFoldLearner,
it passes this example to all but one of its children as training and passes the example to
the last child to evaluate current performance.  The children are used for evaluation in a
round-robin fashion so, if you are using the default 5 way split, all of the children get
80% of the training data for training and get 20% of the data for evaluation.

To avoid the pesky need to configure learning rates, regularization parameters and annealing
schedules, you can use the AdaptiveLogisticRegression.  This class maintains a pool of CrossFoldLearners
and adapts learning rates and regularization on the fly so that you don't have to.

Here is a class diagram for the classifiers.sgd package.  As you can see, the number of twiddlable
knobs is pretty large.  For some examples, see the TrainNewsGroups example code.


Change your notification preferences:

View raw message