mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: KNN Classifier project
Date Sun, 27 Mar 2011 00:20:44 GMT
On Sat, Mar 26, 2011 at 4:29 PM, Daniel McEnnis <dmcennis@gmail.com> wrote:

> Ted,
>
> Please keep in mind, I just downloaded the Mahout code today.  My
> knowledge is from a single presentation and Cloudera's Hadoop
> tutorial.


That shouldn't be a problem.  There are many who can help you state things
concisely in fashionable terminology.


> My goal is to have two stages - training takes a sequence
> of vectors and classifications and creates a large hdfs file of the
> form vector-classification.


"training" is a big word here.  Reformatting might be as accurate since no
transformation to speak of occurs.


> This file is streamed to each node
> classifying an incoming set of vectors.


A key here is that the training data are streamed to each node.

In map-reduce, there is typically an asymmetry between the function
represented by the map and the sequence of values to which the map function
is applied.  Often, the map function is parametrized by some second kind of
data that is distinct from the sequence of values presented to it.  In some
computations, there are two kinds of inputs and we must decide which
represents parameters to the map functions and which represents streaming
input.

Commonly the larger of these two is considered best to stream and the lesser
best to serve as parameters.  It is also helpful to consider which of the
two kinds of input can be split most easily.  In your case, the larger is
probably going to commonly be the input data.  Also, both can be split.  The
test cases can be split trivially and the training data can be split if you
allow for the k-best from each split to be merged later in a reduce
function.


Each vector is compared
> against the vector to be classified and the table of k best matches is
> created from this.  Majority wins, resulting in key-classification or
> classification-key output.  With streaming of the training file, only
> k+2 vectors are needed in memory, achieving O(1) memory use and
> embarrassingly parallel execution.
>

Almost true.  Each mapper only needs k best elements so far plus the latest
training example plus the vector being classified.  But two things increase
this slightly.

- first, since each mapper only sees a subset of all input, the output from
each mapper must be combined.  This typically requires 2k vectors in the
reducer.

- secondly, since it is very expensive to run an entire map-reduce program,
it is nice to classify many vectors at once.  If we classify p vectors in a
single execution, then we need p (k+1) + 1 vectors in the mapper and 2 k
vectors in the reducer.

If p is very large, then we might like to read batches of training vectors
and then iteratively explicitly iterate through batches of test vectors in
the mapper.  The iteration can also be forced outside of the map-reduce
program entirely such that we would run the map-reduce program several times
with p small enough in each run to allow memory to suffice.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message