datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jian wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
Date Sun, 09 Feb 2014 01:10:20 GMT

    [ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895791#comment-13895791
] 

jian wang commented on DATAFU-16:
---------------------------------

Matt, Do you think we go ahead to implement the exponential jump only for the accumulate-based
model? And for algebraic, we still use the weighted reservoir sampling without exponential
jump. 

The good part of introducing the exp jump:  it could improve the job performance, especially
when there is a lot of data to process, without sacrificing much on the sampling precision(per-item
sampling probability is close to w/sum(w)). 

The not good part: the chance of using accumulate-based model may not be as many as algebraic,
so is it worthwhile to introduce this enhancement?

> weighted reservoir sampling with exponential jumps UDF
> ------------------------------------------------------
>
>                 Key: DATAFU-16
>                 URL: https://issues.apache.org/jira/browse/DATAFU-16
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac, Linux
> pig-0.11
>            Reporter: jian wang
>            Priority: Minor
>         Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted reservoir sampling
algorithm with exponential jumps. Investigation is tracked in  https://github.com/linkedin/datafu/issues/80.
This task is part of experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message