Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 89E8517497 for ; Mon, 4 May 2015 00:19:31 +0000 (UTC) Received: (qmail 580 invoked by uid 500); 4 May 2015 00:19:31 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 537 invoked by uid 500); 4 May 2015 00:19:31 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 527 invoked by uid 99); 4 May 2015 00:19:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2015 00:19:31 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_IMAGE_ONLY_32,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.191.145.13 which is an MX secondary for user@crunch.apache.org) Received: from [54.191.145.13] (HELO mx1-us-west.apache.org) (54.191.145.13) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2015 00:19:27 +0000 Received: from mail-wg0-f54.google.com (mail-wg0-f54.google.com [74.125.82.54]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id B7EA724BFB for ; Mon, 4 May 2015 00:19:06 +0000 (UTC) Received: by wgso17 with SMTP id o17so135584273wgs.1 for ; Sun, 03 May 2015 17:18:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Q5pFNn5woIMB/0/Mnumo8HETHYazPqU7oWDfo+HIom8=; b=YBTy+iebOC7sE7c5jPCwRftb1E4/L959e83uMcGLtu9VYIoSMnSWjJvupmw7a2ReW4 m0ra7RfBK++wLUS5hQlbeIyVCwIarHcutwTiVL+R0CLk9eBbgSaqF8szSGCs4hP9KyTT MGGHxYOt3Z6De3Y0GkWXYSC6dakyBq1kwPTTtbO/NLbhsQMb3XCNEKbDsX5F9698tuzP QBP9wegtbegd86LUprQO4vK1NbLJi8AphRpg6NXAsW9yLMi99nfmaEzvzBdG9fiJrXv6 JG1YWB8qwXg97pkOTRocniVt4KeR6xNZzxck7URRuqNF3eDrhWow6g8TZe8H9H6G/+1F ZfIg== MIME-Version: 1.0 X-Received: by 10.194.82.38 with SMTP id f6mr35533451wjy.16.1430698700146; Sun, 03 May 2015 17:18:20 -0700 (PDT) Received: by 10.28.173.21 with HTTP; Sun, 3 May 2015 17:18:20 -0700 (PDT) In-Reply-To: References: Date: Mon, 4 May 2015 02:18:20 +0200 Message-ID: Subject: Re: Access number of reducer tasks from Crunch From: Vincent Fabro To: user@crunch.apache.org Content-Type: multipart/alternative; boundary=047d7bf0bfc2cd8b780515367fe9 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bf0bfc2cd8b780515367fe9 Content-Type: text/plain; charset=UTF-8 Ok, I missed Aggregate.top() (guess my research wasn't thorough). I'll go with the framework's built-in function, seem cleaner than using Context. Thanks a lot for your answers! Vincent On Sun, May 3, 2015 at 8:11 AM, Josh Wills wrote: > Hey Vincent, > > Yeah, you can get at it. Each DoFn inherits a protected getContext() > method that has the getNumReduceTasks() method defined on it, just like it > does in the Nutch code you cited. We try (with varying degrees of success) > to make the underlying MR framework as accessible as possible. > > J > > On Sun, May 3, 2015 at 2:16 AM, David Ortiz wrote: > >> Do you actually care about the number of reducers, or just get top n from >> a table? The latter is built into the framework. >> >> On Sat, May 2, 2015, 6:12 PM Vincent Fabro >> wrote: >> >>> Dear all >>> >>> Is it possible to access the number of reducer tasks from Crunch >>> (something equivalent to context.getNumReduceTasks() in Hadoop)? >>> >>> Context: I'm porting Nutch to Crunch. One operation (in >>> GeneratorJob.java, GeneratorMapper.java and GeneratorReducer.java - >>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java) >>> takes the n top urls acccording to a score. If I understand well, "n/num of >>> reduce tasks" urls are selected for each reduce task (GeneratorReducer, >>> line 102). If there's a good shuffle, the result is good enough. >>> >>> Thanks in advance! >>> >>> Vincent >>> >> > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills > --047d7bf0bfc2cd8b780515367fe9 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Ok, I missed Aggregate.top() (guess my rese= arch wasn't thorough).
I'll go with the framework's bu= ilt-in function, seem cleaner than using Context.

Thanks a lot= for your answers!

Vincent

On Sun, May 3, 2015 at 8:11 AM, Josh Wills = <jwills@cloudera.com> wrote:
Hey Vincent,

Yeah, you can get at = it. Each DoFn inherits a protected getContext() method that has the getNumR= educeTasks() method defined on it, just like it does in the Nutch code you = cited. We try (with varying degrees of success) to make the underlying MR f= ramework as accessible as possible.

J

On S= un, May 3, 2015 at 2:16 AM, David Ortiz <dpo5003@gmail.com> = wrote:

Do you actually car= e about the number of reducers, or just get top n from a table?=C2=A0 The l= atter is built into the framework.


On Sat, May 2, 2015, 6:12 PM=C2=A0Vincent Fa= bro <= vincent.fabro.nutch@gmail.com> wrote:
Dear all

Is it possible to access the=20 number of reducer tasks from Crunch (something equivalent to=20 context.getNumReduceTasks() in Hadoop)?

Context: I'm=20 porting Nutch to Crunch. One operation (in=C2=A0 GeneratorJob.java, Generat= orMapper.java and GeneratorReducer.java - https://github.com/apache/nutch/blob/2.x/src/java/org/apac= he/nutch/crawl/GeneratorReducer.java) takes=20 the n top urls acccording to a score. If I understand well, "n/num of= =20 reduce tasks" urls are selected for each reduce task (GeneratorReducer= , line 102). If there's a good=20 shuffle, the result is good enough.

Thanks in advance!

<= /div>
Vincent=



<= /div>--
Direct= or of Data Science
Twitter: @josh_wills

--047d7bf0bfc2cd8b780515367fe9--