crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Fabro <vincent.fabro.nu...@gmail.com>
Subject Re: Access number of reducer tasks from Crunch
Date Mon, 04 May 2015 00:18:20 GMT
Ok, I missed Aggregate.top() (guess my research wasn't thorough).
I'll go with the framework's built-in function, seem cleaner than using
Context.

Thanks a lot for your answers!

Vincent

On Sun, May 3, 2015 at 8:11 AM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Vincent,
>
> Yeah, you can get at it. Each DoFn inherits a protected getContext()
> method that has the getNumReduceTasks() method defined on it, just like it
> does in the Nutch code you cited. We try (with varying degrees of success)
> to make the underlying MR framework as accessible as possible.
>
> J
>
> On Sun, May 3, 2015 at 2:16 AM, David Ortiz <dpo5003@gmail.com> wrote:
>
>> Do you actually care about the number of reducers, or just get top n from
>> a table?  The latter is built into the framework.
>>
>> On Sat, May 2, 2015, 6:12 PM Vincent Fabro <vincent.fabro.nutch@gmail.com>
>> wrote:
>>
>>> Dear all
>>>
>>> Is it possible to access the number of reducer tasks from Crunch
>>> (something equivalent to context.getNumReduceTasks() in Hadoop)?
>>>
>>> Context: I'm porting Nutch to Crunch. One operation (in
>>> GeneratorJob.java, GeneratorMapper.java and GeneratorReducer.java -
>>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java)
>>> takes the n top urls acccording to a score. If I understand well, "n/num of
>>> reduce tasks" urls are selected for each reduce task (GeneratorReducer,
>>> line 102). If there's a good shuffle, the result is good enough.
>>>
>>> Thanks in advance!
>>>
>>> Vincent
>>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message