crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Access number of reducer tasks from Crunch
Date Sun, 03 May 2015 06:11:31 GMT
Hey Vincent,

Yeah, you can get at it. Each DoFn inherits a protected getContext() method
that has the getNumReduceTasks() method defined on it, just like it does in
the Nutch code you cited. We try (with varying degrees of success) to make
the underlying MR framework as accessible as possible.

J

On Sun, May 3, 2015 at 2:16 AM, David Ortiz <dpo5003@gmail.com> wrote:

> Do you actually care about the number of reducers, or just get top n from
> a table?  The latter is built into the framework.
>
> On Sat, May 2, 2015, 6:12 PM Vincent Fabro <vincent.fabro.nutch@gmail.com>
> wrote:
>
>> Dear all
>>
>> Is it possible to access the number of reducer tasks from Crunch
>> (something equivalent to context.getNumReduceTasks() in Hadoop)?
>>
>> Context: I'm porting Nutch to Crunch. One operation (in
>> GeneratorJob.java, GeneratorMapper.java and GeneratorReducer.java -
>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java)
>> takes the n top urls acccording to a score. If I understand well, "n/num of
>> reduce tasks" urls are selected for each reduce task (GeneratorReducer,
>> line 102). If there's a good shuffle, the result is good enough.
>>
>> Thanks in advance!
>>
>> Vincent
>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message