Mailing-List: contact user-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@crunch.apache.org
Received-SPF: pass (athena.apache.org: message received from 54.191.145.13
 which is an MX secondary for user@crunch.apache.org)
MIME-Version: 1.0
In-Reply-To: 
 <CAH29n6MXpm_cR2U7f39mUXb7rrEgBf1aJDcgo-Ur_2CvAdZzvQ@mail.gmail.com>
References: 
 <CAB9U8e6MOds9+H5FzbtxSOz=aj-8h5o8BzOFKEM=Mudjs_+54g@mail.gmail.com>
	<CAC79LcYq7Axw75cSxK0T3RyRr-f8y=7hYfxPTWKM3iAvOS5Xew@mail.gmail.com>
	<CAH29n6MXpm_cR2U7f39mUXb7rrEgBf1aJDcgo-Ur_2CvAdZzvQ@mail.gmail.com>
Date: Mon, 4 May 2015 02:18:20 +0200
Message-ID: 
 <CAB9U8e5MXdFs7U4Ww_b1uatYEafaz5XJEu+b7KnACZN2pmZPdA@mail.gmail.com>
Subject: Re: Access number of reducer tasks from Crunch
From: Vincent Fabro <vincent.fabro.nutch@gmail.com>
To: user@crunch.apache.org
Content-Type: multipart/alternative; boundary=047d7bf0bfc2cd8b780515367fe9

--047d7bf0bfc2cd8b780515367fe9
Content-Type: text/plain; charset=UTF-8

Ok, I missed Aggregate.top() (guess my research wasn't thorough).
I'll go with the framework's built-in function, seem cleaner than using
Context.

Thanks a lot for your answers!

Vincent

On Sun, May 3, 2015 at 8:11 AM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Vincent,
>
> Yeah, you can get at it. Each DoFn inherits a protected getContext()
> method that has the getNumReduceTasks() method defined on it, just like it
> does in the Nutch code you cited. We try (with varying degrees of success)
> to make the underlying MR framework as accessible as possible.
>
> J
>
> On Sun, May 3, 2015 at 2:16 AM, David Ortiz <dpo5003@gmail.com> wrote:
>
>> Do you actually care about the number of reducers, or just get top n from
>> a table?  The latter is built into the framework.
>>
>> On Sat, May 2, 2015, 6:12 PM Vincent Fabro <vincent.fabro.nutch@gmail.com>
>> wrote:
>>
>>> Dear all
>>>
>>> Is it possible to access the number of reducer tasks from Crunch
>>> (something equivalent to context.getNumReduceTasks() in Hadoop)?
>>>
>>> Context: I'm porting Nutch to Crunch. One operation (in
>>> GeneratorJob.java, GeneratorMapper.java and GeneratorReducer.java -
>>> https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java)
>>> takes the n top urls acccording to a score. If I understand well, "n/num of
>>> reduce tasks" urls are selected for each reduce task (GeneratorReducer,
>>> line 102). If there's a good shuffle, the result is good enough.
>>>
>>> Thanks in advance!
>>>
>>> Vincent
>>>
>>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

--047d7bf0bfc2cd8b780515367fe9
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div>Ok, I missed Aggregate.top() (guess my rese=
arch wasn&#39;t thorough).<br></div>I&#39;ll go with the framework&#39;s bu=
ilt-in function, seem cleaner than using Context.<br><br></div>Thanks a lot=
 for your answers!<br><br></div>Vincent<br></div><div class=3D"gmail_extra"=
><br><div class=3D"gmail_quote">On Sun, May 3, 2015 at 8:11 AM, Josh Wills =
<span dir=3D"ltr">&lt;<a href=3D"mailto:jwills@cloudera.com" target=3D"_bla=
nk">jwills@cloudera.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><div dir=3D"ltr">Hey Vincent,<div><br></div><div>Yeah, you can get at =
it. Each DoFn inherits a protected getContext() method that has the getNumR=
educeTasks() method defined on it, just like it does in the Nutch code you =
cited. We try (with varying degrees of success) to make the underlying MR f=
ramework as accessible as possible.</div><div><br>J</div></div><div class=
=3D"gmail_extra"><div><div class=3D"h5"><br><div class=3D"gmail_quote">On S=
un, May 3, 2015 at 2:16 AM, David Ortiz <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:dpo5003@gmail.com" target=3D"_blank">dpo5003@gmail.com</a>&gt;</span> =
wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bord=
er-left:1px #ccc solid;padding-left:1ex"><p dir=3D"ltr">Do you actually car=
e about the number of reducers, or just get top n from a table?=C2=A0 The l=
atter is built into the framework. </p><div><div>
<br><div class=3D"gmail_quote">On Sat, May 2, 2015, 6:12 PM=C2=A0Vincent Fa=
bro &lt;<a href=3D"mailto:vincent.fabro.nutch@gmail.com" target=3D"_blank">=
vincent.fabro.nutch@gmail.com</a>&gt; wrote:<br><blockquote class=3D"gmail_=
quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1=
ex"><div dir=3D"ltr">Dear all<br><br>Is it possible to access the=20
number of reducer tasks from Crunch (something equivalent to=20
context.getNumReduceTasks() in Hadoop)?<br><br>Context: I&#39;m=20
porting Nutch to Crunch. One operation (in=C2=A0 GeneratorJob.java, Generat=
orMapper.java and GeneratorReducer.java - <a href=3D"https://github.com/apa=
che/nutch/blob/2.x/src/java/org/apache/nutch/crawl/GeneratorReducer.java" t=
arget=3D"_blank">https://github.com/apache/nutch/blob/2.x/src/java/org/apac=
he/nutch/crawl/GeneratorReducer.java</a>) takes=20
the n top urls acccording to a score. If I understand well, &quot;n/num of=
=20
reduce tasks&quot; urls are selected for each reduce task (GeneratorReducer=
, line 102). If there&#39;s a good=20
shuffle, the result is good enough.<br><br>Thanks in advance!<div><div><img=
 src=3D"https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif"><br><=
/div></div></div><div dir=3D"ltr"><div><div><font color=3D"#888888">Vincent=
</font><br></div></div></div></blockquote></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span class=3D"HOEnZb"><font color=3D"#888888">-- <br><div><div>Direct=
or of Data Science</div><div><a href=3D"http://www.cloudera.com" target=3D"=
_blank">Cloudera</a></div><div>Twitter: <a href=3D"http://twitter.com/josh_=
wills" target=3D"_blank">@josh_wills</a></div></div>
</font></span></div>
</blockquote></div><br></div>

--047d7bf0bfc2cd8b780515367fe9--