datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jian wang (JIRA)" <>
Subject [jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
Date Sun, 09 Feb 2014 01:10:20 GMT


jian wang commented on DATAFU-16:

Matt, Do you think we go ahead to implement the exponential jump only for the accumulate-based
model? And for algebraic, we still use the weighted reservoir sampling without exponential

The good part of introducing the exp jump:  it could improve the job performance, especially
when there is a lot of data to process, without sacrificing much on the sampling precision(per-item
sampling probability is close to w/sum(w)). 

The not good part: the chance of using accumulate-based model may not be as many as algebraic,
so is it worthwhile to introduce this enhancement?

> weighted reservoir sampling with exponential jumps UDF
> ------------------------------------------------------
>                 Key: DATAFU-16
>                 URL:
>             Project: DataFu
>          Issue Type: New Feature
>         Environment: Mac, Linux
> pig-0.11
>            Reporter: jian wang
>            Priority: Minor
>         Attachments:,,
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted reservoir sampling
algorithm with exponential jumps. Investigation is tracked in
This task is part of experiment of different weighted sampling algorithms.

This message was sent by Atlassian JIRA

View raw message