crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-178) Add library functions for performing distributed reservoir sampling
Date Wed, 13 Mar 2013 08:06:13 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Wills updated CRUNCH-178:
------------------------------

    Attachment: CRUNCH-178d.patch

I think that we want to distinguish between "seed not given" (and hence null-valued) and "seed
= 0" in this context. We're making some compromises in the Sample.sample method to ensure
that we have a consistent view of the backing dataset, e.g., if we have:

PCollection<T> input = ...;
PCollection<T> sampled = Sample.sample(input, 0.05);

...then we want/expect that the "sampled" PCollection should have the same contents no matter
when we run a MapReduce over it. This requires that we create a seed at the time that Sample.sample
is called. The rub of doing this is that the sample we create won't be truly random: since
all of the partitions use the same seed, they'll all generate the same sequence of random
numbers, which means that we'll see the same "slice" of each partition of the data. That said,
I believe that this lack of randomness is necessary here to preserve the idea that a PCollection
is truly immutable. We could do something fancy here, like adding a salt based on the task
ID, if it ever became a real issue.

In the reservoir sampling case, we don't have this restriction: reservoir sampling kicks off
a MR job, so the PCollection that is returned will be materialized on disk somewhere, and
so the view of it will already be immutable. Therefore, we are free to be "more" random here,
and use a different Random instance (with a different seed) for all of the partitions of the
data.

Aside from that, javadoc'd properly in the attached patch, mostly via copy and paste, and
fixed the '<' characters. Don't apologize for nits, it's the only way we're ever going
to get this stuff cleaned up.
                
> Add library functions for performing distributed reservoir sampling
> -------------------------------------------------------------------
>
>                 Key: CRUNCH-178
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-178
>             Project: Crunch
>          Issue Type: Improvement
>          Components: MapReduce Patterns
>            Reporter: Josh Wills
>         Attachments: CRUNCH-178b.patch, CRUNCH-178c.patch, CRUNCH-178d.patch, CRUNCH-178.patch
>
>
> For a project I've been working on, I wrote up some Crunch functions for performing reservoir
sampling and weighted reservoir sampling that I think would be useful enough to put in lib.*
Here's the paper that I used as a reference for the implementations I wrote:
> http://arxiv.org/pdf/1012.0256.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message