mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: distributed RandomSampler job?
Date Tue, 09 Aug 2011 15:50:57 GMT
Well, yes, that does sound like what I said.

But in that case, the mapper should just pass all of the data on to the
reducer.

You are limited to sample sizes that are the size of your data.  And I
certainly phrased it in a way that implied that your sample has to fit into
memory.   There are out-of-core reservoir samplers, but I can't remember the
last time I needed such a thing.

Keep in mind also the possibility of using a percentage based sampler.
 Those require no extra memory and would only require a reducer if you want
fewer output files.

On Tue, Aug 9, 2011 at 8:44 AM, Timothy Potter <thelabdude@gmail.com> wrote:

> Hi Ted,
>
> Can you clarify your point about "each mapper needs to retain as many
> samples are desired in the end"? Does this mean I'm restricted to sample
> sizes based on the max number of key/value pairs in a split? From what I've
> read in the Hadoop docs, the number of map tasks for a job is determined by
> the number of splits with mapred.map.tasks being only a hint to Hadoop ...
>
> Tim
>
>
> On Tue, Aug 9, 2011 at 12:49 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > On Mon, Aug 8, 2011 at 10:46 PM, Lance Norskog <goksron@gmail.com>
> wrote:
> >
> > > Do the parallel sampler mappers need to be deterministic? That is, do
> > > they all start with the same random seed?
> > >
> >
> > No.  Just the opposite.  They need to be independent.
> >
> >
> > > Can the mapper generate a high-quality hash of each vector, and throw
> > > away a part of the output space?
> >
> >
> > No.  Each sample is a vector which must be accepted or rejected.  If
> > accepted, then it is kept until the end of the split and then sent in a
> > group to the reducer.
> >
> >
> > > This would serve as a first cut in
> > > the mapper. Using the hash (or part of the hash) as the key for the
> > > remaining values allows tuning the number of keys v.s. how many
> > > samples a reducer receives.
> > >
> >
> > Sort of.  To be fair, each mapper has to retain as many samples are
> desired
> > in the end.  Then the reducer has to take a fair sample of all of the
> > groups
> > that it receives accounting for the fact that each group is from a
> > (potentially) different sized input stream.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message