mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: [jira] [Commented] (MAHOUT-668) Adding knn support to Mahout classifiers
Date Sun, 22 May 2011 04:47:21 GMT
On Sat, May 21, 2011 at 5:47 PM, Daniel McEnnis (JIRA) <jira@apache.org>wrote:

> 1. Use case: This is the algorithm for those learning problems that are
> simply too massive even for Mahout's memory streamlined algorithms.
>  Particularly for knn, its the advertising company with 50,000 classes of
> people, tens to hundreds of millions of examples and many terabytes of log
> data to classify which type of person a log belongs to.  Memory footprint
> becomes the biggest issue as even the model takes more memory than what is
> available.  For the other Mahout classifiers, training data size is limited
> to available memory on data nodes.
>

Actually not.  In fact, this is not true for any of the other model training
algorithms in Mahout except kind of sort of, but not really for the random
forest.  For the Naive Bayes algorithms and the SGD algorithms it is
distinctly not true.


> 3.  These distance measures have very different assumptions from those in
> recommendation. A missing vector entry (say in sparse vector format) means
> 0, not missing.  This requires a hack of all distance measures to
> accommodate it.
>

I don't see why.  Most of the other distance measures in Mahout use this
same convention.  Certainly v1.getDifferenceSquared and
v1.minus(v2).assign(Functions.abs).sum() would give you results that assume
0's for missing elements.

I really think that the sub-classes
of org.apache.mahout.common.distance.DistanceMeasure do just what you are
saying that you want.

The measures are also 0 - Infinity, not -1 - 1 and the smaller the better.
>  Cosine distance doesn't fit this, so its got a transform to map it to 0-2
> where smaller is better.
>

My point was that cosine distance is essentially the same as Euclidean
distance.  Why not just use that?



> KL Distance is based on entropy.  I'll double check my references for the
> details.
>

I am pretty sure that you are looking at Kuhlback-Liebler divergence.  I
think you just need to put in a wikipedia reference.  Your javadoc is not
quite correct in any case.


> 5. standard classifier - Until today, I thought this was specific to the
> Bayes algorithm.  I'll add it to the next patch.
>

Look at org.apache.mahout.classifier.AbstractVectorClassifier


> 6. usability.  Any user reading the javadoc on the entry classes
> ModelBuilder, Classifier, or TestClassifier have instructions on how to
> setup data for this patch.  All three should have their options explained.
>

That isn't want I meant.  Command line documentation is all well and good,
but there should be a usable API as well, especially for deployment in a
working system.  Very few systems can afford to do an entire map-reduce when
they just want to classify a few data points.


> I'll add it to the list of things to put in the next patch.  My
> understanding was that there is no standard for at least input formats in
> Mahout.  This patch describes my proposal for what input formats each Mahout
> component ought to be able to process.
>

If you are pushing for a standard, then that should be independent of your
classifier and you should explain how that interacts with, say, the hashed
vector encoding framework.  See
org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message