datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "OlgaK (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number
Date Mon, 20 Nov 2017 21:03:00 GMT

    [ https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259835#comment-16259835
] 

OlgaK commented on DATAFU-63:
-----------------------------

Thanks for the explanations, Eyal.
I think gradlew script can be "just used" on systems pretty much alike the one it has been
built on, as soon as the target system will differ, I expect it'll stop working. For example
while Ubuntu and Fedora are both Linux, they are so to say different breeds and therefore
the script gets changed. I think to move it out of git repo makes sense. 
For me AlgebraicEvalFunc (used in SimpleRandomSample) and AccumulatorEvalFunc (used in ReservoirSample)
look substantially different. It is pointed that ReservoirSample doesn't scale but it's because
of it extends  AccumulatorEvalFunc and is a singleton, in other words, it's done unscalable.
And this is very confusing. On a concept level there is no difference between `p` type of
float - fraction of a set and `k` type of int - size of requested sample. What I suggest,
make one module as extension of AlgebraicEvalFunc, which will process both `p` and `k` sample
size parameters and they are easily convertible one into another, just `p` doesn't change
when we cut the original set in chunks while `k` does, but it doesn't mean that the same chunking
cannot be used. 
Regarding the implementation following the paper. 
For small p (or k) the output of `Initial` will be ways bigger than needed, since the run
goes through the entire set (line 270) and doesn't have `k` or 'p` constraint, while for `p|k`
closer to the set size, I doubt you'll get 99% of the requested result. In this case it should
be exclusion from the original set rather than addition to an empty one. Overall uniform random
sampling is very simple and doesn't require that much of sophistication, the task gets trickier
when one go out of uniform distribution and tries to fit a curve  (getQ1 and getQ2). In this
case indeed you can't guarantee (rather chances are miserable) that taken `k` elements will
get someone a desired distribution. I'd suggest to make simple stuff simple. One can get a
guaranteed number or fraction of elements from a set, when selection is uniform, it this case
one doesn't need to go through the entire set in the best case, there are no extra calculations
required and the memory requirements are decent ( ceil(p*n) or k/number_of_chunks ). 
I'll try to make the implementation as I see it, hopefully in the code it'll be more clear
than in the words. Just need to figure out how to adjust `k` if I can't store it in the parent
and resolve gradle version dependency. This should give better performance in all aspects.
   
HTH                                  

> SimpleRandomSample by a fixed number
> ------------------------------------
>
>                 Key: DATAFU-63
>                 URL: https://issues.apache.org/jira/browse/DATAFU-63
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: jian wang
>            Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does not support
random sample a fixed number of items. ReserviorSample may do the work but since it relies
on an in-memory priority queue, memory issue may happen if we are going to sample a huge number
of items, eg: sample 100M from 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that uses Manuver's
rejection threshold to reject items whose weight exceeds the threshold as we go from mapper
to combiner to reducer. The majority part of the algorithm will be very similar to SimpleRandomSample,
except that we do not use Berstein's theory to accept items and replace probability p = k
/ n,  k is the number of items to sample, n is the total number of items local in mapper,
combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a specified
number of rows from grouped data? I’m currently doing this, since it appears that the SAMPLE
operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>            FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the ORDER/LIMIT
isn’t great for performance). Is there a less awkward way to do this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message