Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAMUZbZbwbO8YsmAbi7jw3oRxveNV12r2fVtU1UtHA+QZMT8iKw@mail.gmail.com>
References: 
 <CAMUZbZbwbO8YsmAbi7jw3oRxveNV12r2fVtU1UtHA+QZMT8iKw@mail.gmail.com>
Date: Mon, 11 May 2015 13:26:49 -0400
Message-ID: 
 <CAEo-6+T+VypE6UwDtWqFu5HzxU-nvM7K0L3BkoMQa_hpoOFNMw@mail.gmail.com>
Subject: Re: Filtering by value in Reducer
From: Shahab Yunus <shahab.yunus@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a113d3082db80b90515d1ae95

--001a113d3082db80b90515d1ae95
Content-Type: text/plain; charset=UTF-8

What is the type of the threshold variable? sum I believe is a Java int.

Regards,
Shahab

On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <rutschifengga@gmail.com> wrote:

> Hi,
>
> I am currently playing around with Hadoop and have some problems when
> trying to filter in the Reducer.
>
> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with
> some additional functionality
> and added the possibility to filter by the specific value of each key -
> e.g. only output the key-value pairs where [[ value > threshold ]].
>
> Filtering Code in Reducer
> #####################################
>
> for (IntWritable val : values) {
>      sum += val.get();
> }
> if ( sum > threshold ) {
>      result.set(sum);
>      context.write(key, result);
> }
>
> #####################################
>
> For threshold smaller any value the above code works as expected and the
> output contains all key-value pairs.
> If I increase the threshold to 1 some pairs are missing in the output
> although the respective value would be larger than the threshold.
>
> I tried to work out the error myself, but I could not get it to work as
> intended. I use the exact Tutorial setup with Oracle JDK 8
> on a CentOS 7 machine.
>
> As far as I understand the respective Iterable<...>  in the Reducer
> already contains all the observed values for a specific key.
> Why is it possible that I am missing some of these key-value pairs then?
> It only fails in very few cases. The input file is pretty large - 250 MB -
> so I also tried to increase the memory for the mapping and reduction steps
> but it did not help ( tried a lot of different stuff without success )
>
> Maybe someone already experienced similar problems / is more experienced
> than I am.
>
>
> Thank you,
>
> Peter
>

--001a113d3082db80b90515d1ae95
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">What is the type of the threshold variable? sum I believe =
is a Java int.<div><br></div><div>Regards,</div><div>Shahab</div></div><div=
 class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, May 11, 2015 =
at 1:08 PM, Peter Ruch <span dir=3D"ltr">&lt;<a href=3D"mailto:rutschifengg=
a@gmail.com" target=3D"_blank">rutschifengga@gmail.com</a>&gt;</span> wrote=
:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div>Hi=
,<br><br></div><div>I am currently playing around with Hadoop and have some=
 problems when trying to filter in the Reducer. <br><br>I extended the Word=
Count v1.0 example from the 2.7 MapReduce Tutorial with some additional fun=
ctionality<br>and added the possibility to filter by the specific value of =
each key - e.g. only output the key-value pairs where [[ value &gt; thresho=
ld ]].<br></div></div><br></div><div>Filtering Code in Reducer<br></div><di=
v>#####################################<br><br></div>for (IntWritable val :=
 values) {<br>=C2=A0=C2=A0=C2=A0=C2=A0 sum +=3D val.get();<br>}<br>if ( sum=
 &gt; threshold ) {<br>=C2=A0=C2=A0=C2=A0=C2=A0 result.set(sum);<br>=C2=A0=
=C2=A0=C2=A0=C2=A0 context.write(key, result);<br>}<br><br>################=
#####################<br><br></div><div>For threshold smaller any value the=
 above code works as expected and the output contains all key-value pairs.<=
br></div><div>If I increase the threshold to 1 some pairs are missing in th=
e output although the respective value would be larger than the threshold.<=
br><br></div><div>I tried to work out the error myself, but I could not get=
 it to work as intended. I use the exact Tutorial setup with Oracle JDK 8 <=
br></div><div>on a CentOS 7 machine.<br><br></div><div>As far as I understa=
nd the respective Iterable&lt;...&gt;=C2=A0 in the Reducer already contains=
 all the observed values for a specific key. <br></div><div>Why is it possi=
ble that I am missing some of these key-value pairs then? It only fails in =
very few cases. The input file is pretty large - 250 MB -<br></div><div>so =
I also tried to increase the memory for the mapping and reduction steps but=
 it did not help ( tried a lot of different stuff without success )<br></di=
v><div><br></div><div>Maybe someone already experienced similar problems / =
is more experienced than I am.<br></div><div><br><br></div><div>Thank you,<=
br><br></div><div>Peter<br></div></div>
</blockquote></div><br></div>

--001a113d3082db80b90515d1ae95--