crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Pinto <andredasilvapi...@gmail.com>
Subject Re: Percentile rank
Date Tue, 07 Apr 2015 07:26:47 GMT
Hi Josh,

Yes. I guess the reasoning to no have the Iterable on Sort.sort but have it
on the Secondary Sort was to avoid people using it on the complete data set
then (and it is assumed that there will be never that much records with the
same key, so it will be OK to iterate over those few records). Seems
reasonable.

Yes, using the Iterable on a single reducer is certainly not the best way
to do this, but considering that there is no (simple) access to the global
index I think there is really no other way. At least iterating over the
Iterable will not move all the data into memory right? It does lazy
loading, so it will just take a lot longer than doing it in parallel.

Thanks.

On Tue, Apr 7, 2015 at 4:06 AM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Andre,
>
> Not sure what you mean precisely-- do you mean an option or method in the
> Sort API that would include the rank of each item?
>
> In general, I like to avoid assuming that one reducer can handle all of
> the data in a PCollection on API methods, which I think is what you're
> saying (i.e., just stream all of the data in sorted order to a single
> reducer.)
>
> J
>
> On Mon, Apr 6, 2015 at 3:19 PM, André Pinto <andredasilvapinto@gmail.com>
> wrote:
>
>> Hi Josh,
>>
>> Thanks for replying.
>>
>> That really sounds very hacky. I was expecting something with a little
>> more support from the API.
>>
>> I guess we could also use sortAndApply with a random generated singleton
>> Key for the entire set of values and then use the Iterable on the Values to
>> obtain the sorted index. It still looks bad though...
>>
>> Just out of curiosity, why isn't the Iterable approach also supported on
>> the simple Sort.sort? Sorry if this looks obvious to you, but I'm still new
>> to Crunch and Hadoop.
>>
>> Thanks.
>>
>> On Thu, Apr 2, 2015 at 6:36 PM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> I can't think of a great way to do it-- knowing exactly which record
>>> you're processing (in any kind of order) in a distributed processing job is
>>> always somewhat fraught. Gun to my head, I would do it in two phases:
>>>
>>> 1) Get the name of the FileSplit for the current task-- which can be
>>> retrieved, although we don't make it easy. You can do it via something like
>>> this from inside of a map-side DoFn:
>>>
>>> InputSplit split = ((MapContext) getContext()).getInputSplit();
>>> FileSplit baseSplit = (FileSplit) ((Supplier<InputSplit>) split).get();
>>>
>>> The count up the number of records inside of each FileSplit. I'm not
>>> sure if you should disable combine files when you do this, but it seems
>>> like a good idea.
>>>
>>> 2) Create a new DoFn that takes the output of the previous job and uses
>>> it to determine exactly which record in order the currently processing
>>> record is, based on the sorted order of the FileSplit names and an internal
>>> counter that gets reset to zero for each new FileSplit.
>>>
>>> J
>>>
>>> On Thu, Apr 2, 2015 at 7:39 AM, André Pinto <andredasilvapinto@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to calculate the percentile ranks for the values of a sorted
>>>> PTable (i.e. at which % rank each element is within the whole data set).
Is
>>>> there a way to do this with Crunch? Seems that we would only need to have
>>>> access to the global index of the record during an iteration over the data
>>>> set.
>>>>
>>>> Thanks in advance,
>>>> André
>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message