mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject classifier architecture needed
Date Mon, 21 Jun 2010 18:12:47 GMT
We are now beginning to have lots of classifiers in Mahout.  The naive
Bayes, complementary naive Bayes and random Forest grandfathers have been
joined by my recent SGD and Zhao Zhendong's prolific set of approaches for
logistic regression and SVM variants.

All of these implementations have similar characteristics and virtually none
are inter-operable.

Even worse, the model produced by a clustering system is really just like a
model produced by a classifier so we should increase the number of sources
of incompatible classifiers even more.  Altogether, we probably have a dozen
ways of building classifiers.

I would like to start a discussion about a framework that we can fit all of
these approaches together in much the same way that the recommendations
stuff has such nice pluggable properties.

As I see it, the opportunities for commonality (aka our current
deficiencies)  include:

- original input format reading

-- the naive Bayes code uses an ad hoc format similar to what Jason Rennie
used for 20 news groups.  This code uses Lucene 3.0 style analyzers.

-- Zhao uses something a lot like SVMLight input format

-- The SGD code looks at CSV data

-- Drew wrote some Avro document code

-- Lucene has been used as a sort of vectors for clustering

My summary here is that the Lucene analyzers look like they could be used
very effectively for our purposes.  We would need to write AttributeFilter's
that do two kinds of vectorization (random project and dictionary based).
We also should have 4 standard input format parsers as examples (CSV,
SVMLight, VowpalWabbit, current naive Bayes format).

We need something simply and general that subsumes all of these input use

- conversion to vectors

-- SGD introduced from random projection

-- Naive bayes has some dictionary based conversions

-- Other stuff does this or that

This should be subsumed into the AttributeFilters that I mentioned above.
 We really just need random projection and Salton style vector space models.
 Clearly, we should allow direct input of vectors as well in case the user
is producing them for us.

- command line option processing

We really need to have a simple way to integrate all of the input processing
options easily into new and old code

- model storage

It would be lovely if we could instantiate a model from a stored form
without even known what kind of learning produced the model.  All of the
classifiers and clustering algorithms should put out something that can be
instantiated this way.  I used Gson in the SGD code and found it pretty
congenial, but I didn't encode the class of the classifier, nor did I
provide a classifier abstract class.  I don't know what k-means or Canopy
clustering produce, nor random forests or Naive Bayes, but I am sure that
all of them are highly specific to the particular kind of model.

I don't know what is best here, but we definitely need something more common
than what we have.

What do others think?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message