crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ortiz <dpo5...@gmail.com>
Subject Re: Percentile rank
Date Tue, 07 Apr 2015 13:27:17 GMT
I lose at proofreading.  Had to completely rewrite a section of one of my
pipelines because of that issue.

On Tue, Apr 7, 2015 at 9:25 AM David Ortiz <dpo5003@gmail.com> wrote:

> That would be the expectation.  Depending on the number of records though,
> it's possible to start getting OutOfMemoryErrors thrown by the Hadoop
> framework during the shuffle/sort phase.  Had to completely a section of
> one of my pipelines because once we ran it on production level data that
> was happening.  Depending on what else you're running on the cluster, that
> particular issue will also be very disruptive to other jobs.
>
> On Tue, Apr 7, 2015 at 3:27 AM André Pinto <andredasilvapinto@gmail.com>
> wrote:
>
>> Hi Josh,
>>
>> Yes. I guess the reasoning to no have the Iterable on Sort.sort but have
>> it on the Secondary Sort was to avoid people using it on the complete data
>> set then (and it is assumed that there will be never that much records with
>> the same key, so it will be OK to iterate over those few records). Seems
>> reasonable.
>>
>> Yes, using the Iterable on a single reducer is certainly not the best way
>> to do this, but considering that there is no (simple) access to the global
>> index I think there is really no other way. At least iterating over the
>> Iterable will not move all the data into memory right? It does lazy
>> loading, so it will just take a lot longer than doing it in parallel.
>>
>> Thanks.
>>
>> On Tue, Apr 7, 2015 at 4:06 AM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Hey Andre,
>>>
>>> Not sure what you mean precisely-- do you mean an option or method in
>>> the Sort API that would include the rank of each item?
>>>
>>> In general, I like to avoid assuming that one reducer can handle all of
>>> the data in a PCollection on API methods, which I think is what you're
>>> saying (i.e., just stream all of the data in sorted order to a single
>>> reducer.)
>>>
>>> J
>>>
>>> On Mon, Apr 6, 2015 at 3:19 PM, André Pinto <andredasilvapinto@gmail.com
>>> > wrote:
>>>
>>>> Hi Josh,
>>>>
>>>> Thanks for replying.
>>>>
>>>> That really sounds very hacky. I was expecting something with a little
>>>> more support from the API.
>>>>
>>>> I guess we could also use sortAndApply with a random generated
>>>> singleton Key for the entire set of values and then use the Iterable on the
>>>> Values to obtain the sorted index. It still looks bad though...
>>>>
>>>> Just out of curiosity, why isn't the Iterable approach also supported
>>>> on the simple Sort.sort? Sorry if this looks obvious to you, but I'm still
>>>> new to Crunch and Hadoop.
>>>>
>>>> Thanks.
>>>>
>>>> On Thu, Apr 2, 2015 at 6:36 PM, Josh Wills <jwills@cloudera.com> wrote:
>>>>
>>>>> I can't think of a great way to do it-- knowing exactly which record
>>>>> you're processing (in any kind of order) in a distributed processing
job is
>>>>> always somewhat fraught. Gun to my head, I would do it in two phases:
>>>>>
>>>>> 1) Get the name of the FileSplit for the current task-- which can be
>>>>> retrieved, although we don't make it easy. You can do it via something
like
>>>>> this from inside of a map-side DoFn:
>>>>>
>>>>> InputSplit split = ((MapContext) getContext()).getInputSplit();
>>>>> FileSplit baseSplit = (FileSplit) ((Supplier<InputSplit>) split).get();
>>>>>
>>>>> The count up the number of records inside of each FileSplit. I'm not
>>>>> sure if you should disable combine files when you do this, but it seems
>>>>> like a good idea.
>>>>>
>>>>> 2) Create a new DoFn that takes the output of the previous job and
>>>>> uses it to determine exactly which record in order the currently processing
>>>>> record is, based on the sorted order of the FileSplit names and an internal
>>>>> counter that gets reset to zero for each new FileSplit.
>>>>>
>>>>> J
>>>>>
>>>>> On Thu, Apr 2, 2015 at 7:39 AM, André Pinto <
>>>>> andredasilvapinto@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to calculate the percentile ranks for the values of a
>>>>>> sorted PTable (i.e. at which % rank each element is within the whole
data
>>>>>> set). Is there a way to do this with Crunch? Seems that we would
only need
>>>>>> to have access to the global index of the record during an iteration
over
>>>>>> the data set.
>>>>>>
>>>>>> Thanks in advance,
>>>>>> André
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Director of Data Science
>>>>> Cloudera <http://www.cloudera.com>
>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>

Mime
View raw message