crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Percentile rank
Date Tue, 07 Apr 2015 02:06:48 GMT
Hey Andre,

Not sure what you mean precisely-- do you mean an option or method in the
Sort API that would include the rank of each item?

In general, I like to avoid assuming that one reducer can handle all of the
data in a PCollection on API methods, which I think is what you're saying
(i.e., just stream all of the data in sorted order to a single reducer.)

J

On Mon, Apr 6, 2015 at 3:19 PM, André Pinto <andredasilvapinto@gmail.com>
wrote:

> Hi Josh,
>
> Thanks for replying.
>
> That really sounds very hacky. I was expecting something with a little
> more support from the API.
>
> I guess we could also use sortAndApply with a random generated singleton
> Key for the entire set of values and then use the Iterable on the Values to
> obtain the sorted index. It still looks bad though...
>
> Just out of curiosity, why isn't the Iterable approach also supported on
> the simple Sort.sort? Sorry if this looks obvious to you, but I'm still new
> to Crunch and Hadoop.
>
> Thanks.
>
> On Thu, Apr 2, 2015 at 6:36 PM, Josh Wills <jwills@cloudera.com> wrote:
>
>> I can't think of a great way to do it-- knowing exactly which record
>> you're processing (in any kind of order) in a distributed processing job is
>> always somewhat fraught. Gun to my head, I would do it in two phases:
>>
>> 1) Get the name of the FileSplit for the current task-- which can be
>> retrieved, although we don't make it easy. You can do it via something like
>> this from inside of a map-side DoFn:
>>
>> InputSplit split = ((MapContext) getContext()).getInputSplit();
>> FileSplit baseSplit = (FileSplit) ((Supplier<InputSplit>) split).get();
>>
>> The count up the number of records inside of each FileSplit. I'm not sure
>> if you should disable combine files when you do this, but it seems like a
>> good idea.
>>
>> 2) Create a new DoFn that takes the output of the previous job and uses
>> it to determine exactly which record in order the currently processing
>> record is, based on the sorted order of the FileSplit names and an internal
>> counter that gets reset to zero for each new FileSplit.
>>
>> J
>>
>> On Thu, Apr 2, 2015 at 7:39 AM, André Pinto <andredasilvapinto@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to calculate the percentile ranks for the values of a sorted
>>> PTable (i.e. at which % rank each element is within the whole data set). Is
>>> there a way to do this with Crunch? Seems that we would only need to have
>>> access to the global index of the record during an iteration over the data
>>> set.
>>>
>>> Thanks in advance,
>>> André
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message