crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Percentile rank
Date Thu, 02 Apr 2015 16:36:08 GMT
I can't think of a great way to do it-- knowing exactly which record you're
processing (in any kind of order) in a distributed processing job is always
somewhat fraught. Gun to my head, I would do it in two phases:

1) Get the name of the FileSplit for the current task-- which can be
retrieved, although we don't make it easy. You can do it via something like
this from inside of a map-side DoFn:

InputSplit split = ((MapContext) getContext()).getInputSplit();
FileSplit baseSplit = (FileSplit) ((Supplier<InputSplit>) split).get();

The count up the number of records inside of each FileSplit. I'm not sure
if you should disable combine files when you do this, but it seems like a
good idea.

2) Create a new DoFn that takes the output of the previous job and uses it
to determine exactly which record in order the currently processing record
is, based on the sorted order of the FileSplit names and an internal
counter that gets reset to zero for each new FileSplit.

J

On Thu, Apr 2, 2015 at 7:39 AM, André Pinto <andredasilvapinto@gmail.com>
wrote:

> Hi,
>
> I'm trying to calculate the percentile ranks for the values of a sorted
> PTable (i.e. at which % rank each element is within the whole data set). Is
> there a way to do this with Crunch? Seems that we would only need to have
> access to the global index of the record during an iteration over the data
> set.
>
> Thanks in advance,
> André
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message