mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: OnlineLogisticRegression: Are my settings sensible
Date Fri, 08 Nov 2013 04:32:17 GMT
Why is FEATURE_NUMBER != 13?

With 12 features that are already lovely and continuous, just stick them in
elements 1..12 of a 13 long vector and put a constant value at the
beginning of it.  Hashed encoding is good for sparse stuff, but confusing
for your case.

Also, it looks like you only pass through the (very small) training set
once.  The OnlineLogisticRegression is unlikely to converge very well with
such a small number of examples.

Finally, in the hashed representation that you are using, you use exactly
the same CVE to put all 15 (12?) of the variables into the vector.  Since
you are using the same CVE, all of these values will be put into exactly
the same location which is going to kill performance since you will get the
effect of summing all your variables together.





On Thu, Nov 7, 2013 at 1:48 PM, Andreas Bauer <buki@gmx.net> wrote:

> Hi,
>
> I’m trying to use OnlineLogisticRegression for a two-class classification
> problem, but as my classification results are not very good, I wanted to
> ask for support to find out if my settings are correct and if I’m using
> Mahout correctly. Because if I’m doing it correctly then probably my
> features are crap...
>
> In total I have 12 features. All are continuous values and all are
> normalized/standardized (has not effect on the classification performance
> at the moment).
>
> Training samples keep flowing in at constant rate (i.e. incremental
> training), but in total it won’t be more than a few thousand (class split
> pos/negative 30:70).
>
> My performance measure do not really get good, e.g. with approx. 3600
> training samples I get
>
> f-measure(beta=0.5): 0.38
> precision: 0.33
> recall: 0.47
>
> The parameters I use are
>
> lambda=0.0001
> offset=1000
> alpha=1
> decay_exponent=0.9
> learning_rate=50
>
>
> FEATURE_NUMBER = 100;
> CATEGORIES_NUMBER = 2;
>
>
>
> Java code snip:
>
> private OnlineLogisticRegression olr;
> private ContinuousValueEncoder continousValueEncoder;
>
> private static final FeatureVectorEncoder BIAS = new
> ConstantValueEncoder("Intercept“);
>
> …
> public Training() {
>        olr = new OnlineLogisticRegression(CATEGORIES_NUMBER,
> FEATURE_NUMBER,new L1()); //L2 or ElasticBandPrior do not affect the
> performance
>
>  olr.lambda(lambda).learningRate(learning_rate).stepOffset(offset).decayExponent(decay_exponent);
>        this.continousValueEncoder = new
> ContinuousValueEncoder("ContinuousValueEncoder");
>        this.continousValueEncoder.setProbes(20);
>       ….
> }
>
>
> public void train(TrainingSample sample, int target){
> DenseVector denseVector = new DenseVector(FEATURE_NUMBER);
> //sample.getFeatureValue1-15() return a double value
>         this.continousValueEncoder.addToVector((byte[]) null,
> sample.getFeatureValue1(), denseVector);
> ….
> this.continousValueEncoder.addToVector((byte[]) null,
> sample.getFeatureValue15(), denseVector);
> BIAS.addToVector((byte[]) null, 1, denseVector);
>         olr.train(target, denseVector);
> }
>
> It is also interesting to notice, that when I use the model both test and
> classification yield always probabilities of 1.0 or 0.99xxx for either
> class.
>
> result = this.olr.classifyFull(input);
> LOGGER.debug("TrainingSink test: classify real category:"
> + realCategory + " olr classifier result: "
> + result.maxValueIndex() + " prob: " + result.maxValue());
>
>
>
>
> Would be great if you could give me some advise.
>
> Many thanks,
>
> Andreas
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message