Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
In-Reply-To: <55520CED.30605@gmail.com>
References: 
 <CAMUZbZbwbO8YsmAbi7jw3oRxveNV12r2fVtU1UtHA+QZMT8iKw@mail.gmail.com>
	<CAEo-6+T+VypE6UwDtWqFu5HzxU-nvM7K0L3BkoMQa_hpoOFNMw@mail.gmail.com>
	<55510E7E.1030505@gmail.com>
	<CAE422GDtgbSDHq7iQyGFBw5TTSku9RfBdhFRY11kLgfLOT_UuA@mail.gmail.com>
	<5551EEFA.6060804@gmail.com>
	<CAEo-6+QxynO7Bxua8+Kuvtw2djBngjLV5GxSdkFY2BRBvR0BpQ@mail.gmail.com>
	<55520CED.30605@gmail.com>
Date: Wed, 13 May 2015 11:12:30 +0900
Message-ID: 
 <CAE422GAGqhX=R7fgPXkeCgCx2SgSgzyrjDjUShLMoyrR+t9kRQ@mail.gmail.com>
Subject: Re: Re: Re: Re: Filtering by value in Reducer
From: =?UTF-8?B?RHJha2Xrr7zsmIHqt7w=?= <drake.min@nexr.com>
To: user <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=e89a8fb2005eb52ca80515ed24e2

--e89a8fb2005eb52ca80515ed24e2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi,

Did you try mapreduce local mode with smaller input data? Or write test
case with MRUnit is very helpful for debugging.

Thanks.

Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D
kt NexR

On Tue, May 12, 2015 at 11:23 PM, Peter Ruch <rutschifengga@gmail.com>
wrote:

>  Hi,
>
> No, I did not create any custom logs, I was only looking through the
> "standard" logs.
> I just started out with Hadoop and did not think of explicitly logging
> that part of the code,
> as I thought that I am simply missing a small detail that someone of you
> might spot.
>
> But I will definitely look into the custom logging and post my findings.
>
> @ Shahab and Drake: Thank you very much for your help.
>
>
> Best,
> Peter
>
>
>
> On 12.05.2015 14:57, Shahab Yunus wrote:
>
> Have you tried explicitly printing or logging in you reducer around the
> code that compares and then outputs the values? Maybe that will give you =
a
> clue that what is happening? Debug the threshold value that you get in th=
e
> reducer and whether that is what you have set or not (in case of when you
> set it to greater than -1)?
>
> You can also try to use compare method for comparing IntWritables though =
I
> doubt that would make any difference.
>
> Shahab
> On May 12, 2015 8:17 AM, "Peter Ruch" <rutschifengga@gmail.com> wrote:
>
>>  Hi,
>>
>> I already skimmed through the logs but I could not find anything special=
.
>>
>> I am just really confused why I am having this problem.
>>
>> If the Iterable<...> for a specific key contains all of the observed
>> values - and it seems to do so
>> otherwise the program wouldn't work correctly in the standard case with
>> [[ threshold =3D -1 ]] -
>> it should also work when I only write the key-value pairs to the output
>> file that suffice the condition [[ sum > threshold ]].
>>
>> Did I miss something? Maybe I have to handle these cases in a specific
>> way, but I did not find anything about that online.
>>
>>
>> Thank you for your help,
>>
>> Peter
>>
>>
>>
>> On 12.05.2015 12:35, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrote:
>>
>> Hi, Peter
>>
>>  The missing records, they are just gone without no logs? How about your
>> reduce tasks logs?
>>
>>  Thanks
>>
>>   Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D
>> kt NexR
>>
>> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <rutschifengga@gmail.com>
>> wrote:
>>
>>>  Hello,
>>>
>>> sum and threshold are both Integers.
>>> for the threshold variable I first add a new resource to the
>>> configuration - conf.addResource( ... );
>>>
>>> later I get the threshold value from the configuration.
>>>
>>> Code
>>> #####################################
>>>
>>> private int threshold;
>>>
>>> public void setup( Context context ) {
>>>
>>>           Configuration conf =3D context.getConfiguration();
>>>           threshold =3D conf.getInt( "threshold", -1 );
>>>
>>> }
>>>
>>> #####################################
>>>
>>>
>>> Best,
>>> Peter
>>>
>>>
>>>
>>> On 11.05.2015 19:26, Shahab Yunus wrote:
>>>
>>> What is the type of the threshold variable? sum I believe is a Java int=
.
>>>
>>>  Regards,
>>> Shahab
>>>
>>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <rutschifengga@gmail.com>
>>> wrote:
>>>
>>>>   Hi,
>>>>
>>>>  I am currently playing around with Hadoop and have some problems when
>>>> trying to filter in the Reducer.
>>>>
>>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial
>>>> with some additional functionality
>>>> and added the possibility to filter by the specific value of each key =
-
>>>> e.g. only output the key-value pairs where [[ value > threshold ]].
>>>>
>>>>  Filtering Code in Reducer
>>>>  #####################################
>>>>
>>>>  for (IntWritable val : values) {
>>>>      sum +=3D val.get();
>>>> }
>>>> if ( sum > threshold ) {
>>>>      result.set(sum);
>>>>      context.write(key, result);
>>>> }
>>>>
>>>> #####################################
>>>>
>>>>  For threshold smaller any value the above code works as expected and
>>>> the output contains all key-value pairs.
>>>>  If I increase the threshold to 1 some pairs are missing in the output
>>>> although the respective value would be larger than the threshold.
>>>>
>>>>  I tried to work out the error myself, but I could not get it to work
>>>> as intended. I use the exact Tutorial setup with Oracle JDK 8
>>>>  on a CentOS 7 machine.
>>>>
>>>>  As far as I understand the respective Iterable<...>  in the Reducer
>>>> already contains all the observed values for a specific key.
>>>>  Why is it possible that I am missing some of these key-value pairs
>>>> then? It only fails in very few cases. The input file is pretty large =
- 250
>>>> MB -
>>>>  so I also tried to increase the memory for the mapping and reduction
>>>> steps but it did not help ( tried a lot of different stuff without suc=
cess )
>>>>
>>>>  Maybe someone already experienced similar problems / is more
>>>> experienced than I am.
>>>>
>>>>
>>>>  Thank you,
>>>>
>>>>  Peter
>>>>
>>>
>>>
>>>
>>
>>
>

--e89a8fb2005eb52ca80515ed24e2
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>Did you try mapreduce local mode wi=
th smaller input data? Or write test case with MRUnit is very helpful for d=
ebugging.</div><div><br></div><div>Thanks.</div></div><div class=3D"gmail_e=
xtra"><br clear=3D"all"><div><div class=3D"gmail_signature"><div dir=3D"ltr=
"><div><div dir=3D"ltr">Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D</div><div>kt=
 NexR</div></div></div></div></div>
<br><div class=3D"gmail_quote">On Tue, May 12, 2015 at 11:23 PM, Peter Ruch=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:rutschifengga@gmail.com" target=3D=
"_blank">rutschifengga@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    Hi,<br>
    <br>
    No, I did not create any custom logs, I was only looking through the
    &quot;standard&quot; logs. <br>
    I just started out with Hadoop and did not think of explicitly
    logging that part of the code,<br>
    as I thought that I am simply missing a small detail that someone of
    you might spot.<br>
    <br>
    But I will definitely look into the custom logging and post my
    findings.<br>
    <br>
    @ Shahab and Drake: Thank you very much for your help.<br>
    <br>
    <br>
    Best,<br>
    Peter<div><div class=3D"h5"><br>
    <br>
    <br>
    <div>On 12.05.2015 14:57, Shahab Yunus
      wrote:<br>
    </div>
    <blockquote type=3D"cite">
      <p dir=3D"ltr">Have you tried explicitly printing or logging in you
        reducer around the code that compares and then outputs the
        values? Maybe that will give you a clue that what is happening?
        Debug the threshold value that you get in the reducer and
        whether that is what you have set or not (in case of when you
        set it to greater than -1)?</p>
      <p dir=3D"ltr">You can also try to use compare method for comparing
        IntWritables though I doubt that would make any difference.</p>
      <p dir=3D"ltr">Shahab</p>
      <div class=3D"gmail_quote">On May 12, 2015 8:17 AM, &quot;Peter Ruch&=
quot;
        &lt;<a href=3D"mailto:rutschifengga@gmail.com" target=3D"_blank">ru=
tschifengga@gmail.com</a>&gt;
        wrote:<br type=3D"attribution">
        <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex">
          <div bgcolor=3D"#FFFFFF" text=3D"#000000"> Hi,<br>
            <br>
            I already skimmed through the logs but I could not find
            anything special.<br>
            <br>
            I am just really confused why I am having this problem. <br>
            <br>
            If the Iterable&lt;...&gt; for a specific key contains all
            of the observed values - and it seems to do so<br>
            otherwise the program wouldn&#39;t work correctly in the
            standard case with [[ threshold =3D -1 ]] -<br>
            it should also work when I only write the key-value pairs to
            the output file that suffice the condition [[ sum &gt;
            threshold ]].<br>
            <br>
            Did I miss something? Maybe I have to handle these cases in
            a specific way, but I did not find anything about that
            online.<br>
            <br>
            <br>
            Thank you for your help,<br>
            <br>
            Peter<br>
            <br>
            <br>
            <br>
            <div>On 12.05.2015 12:35, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrot=
e:<br>
            </div>
            <blockquote type=3D"cite">
              <div dir=3D"ltr">Hi, Peter
                <div><br>
                </div>
                <div>The missing records, they are just gone without no
                  logs? How about your reduce tasks logs?</div>
                <div><br>
                </div>
                <div>Thanks</div>
              </div>
              <div class=3D"gmail_extra"><br clear=3D"all">
                <div>
                  <div>
                    <div dir=3D"ltr">
                      <div>
                        <div dir=3D"ltr">Drake =EB=AF=BC=EC=98=81=EA=B7=BC =
Ph.D</div>
                        <div>kt NexR</div>
                      </div>
                    </div>
                  </div>
                </div>
                <br>
                <div class=3D"gmail_quote">On Tue, May 12, 2015 at 5:18
                  AM, Peter Ruch <span dir=3D"ltr">&lt;<a href=3D"mailto:ru=
tschifengga@gmail.com" target=3D"_blank">rutschifengga@gmail.com</a>&gt;</s=
pan>
                  wrote:<br>
                  <blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex">
                    <div bgcolor=3D"#FFFFFF" text=3D"#000000"> Hello,<br>
                      <br>
                      sum and threshold are both Integers. <br>
                      for the threshold variable I first add a new
                      resource to the configuration - conf.addResource(
                      ... );<br>
                      <br>
                      later I get the threshold value from the
                      configuration. <br>
                      <br>
                      Code<br>
                      #####################################<br>
                      <br>
                      private int threshold;<br>
                      <br>
                      public void setup( Context context ) {<br>
                      <br>
                      =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0 Configuration conf =3D
                      context.getConfiguration();<br>
                      =C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0=C2=A0 threshol=
d =3D conf.getInt( &quot;threshold&quot;, -1
                      );<br>
                      <br>
                      }<br>
                      <br>
                      #####################################<br>
                      <br>
                      <br>
                      Best,<br>
                      Peter
                      <div>
                        <div><br>
                          <br>
                          <br>
                          <div>On 11.05.2015 19:26, Shahab Yunus wrote:<br>
                          </div>
                          <blockquote type=3D"cite">
                            <div dir=3D"ltr">What is the type of the
                              threshold variable? sum I believe is a
                              Java int.
                              <div><br>
                              </div>
                              <div>Regards,</div>
                              <div>Shahab</div>
                            </div>
                            <div class=3D"gmail_extra"><br>
                              <div class=3D"gmail_quote">On Mon, May 11,
                                2015 at 1:08 PM, Peter Ruch <span dir=3D"lt=
r">&lt;<a href=3D"mailto:rutschifengga@gmail.com" target=3D"_blank">rutschi=
fengga@gmail.com</a>&gt;</span>
                                wrote:<br>
                                <blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
                                  <div dir=3D"ltr">
                                    <div>
                                      <div>
                                        <div>
                                          <div>Hi,<br>
                                            <br>
                                          </div>
                                          <div>I am currently playing
                                            around with Hadoop and have
                                            some problems when trying to
                                            filter in the Reducer. <br>
                                            <br>
                                            I extended the WordCount
                                            v1.0 example from the 2.7
                                            MapReduce Tutorial with some
                                            additional functionality<br>
                                            and added the possibility to
                                            filter by the specific value
                                            of each key - e.g. only
                                            output the key-value pairs
                                            where [[ value &gt;
                                            threshold ]].<br>
                                          </div>
                                        </div>
                                        <br>
                                      </div>
                                      <div>Filtering Code in Reducer<br>
                                      </div>
                                      <div>################################=
#####<br>
                                        <br>
                                      </div>
                                      for (IntWritable val : values) {<br>
                                      =C2=A0=C2=A0=C2=A0=C2=A0 sum +=3D val=
.get();<br>
                                      }<br>
                                      if ( sum &gt; threshold ) {<br>
                                      =C2=A0=C2=A0=C2=A0=C2=A0 result.set(s=
um);<br>
                                      =C2=A0=C2=A0=C2=A0=C2=A0 context.writ=
e(key, result);<br>
                                      }<br>
                                      <br>
#####################################<br>
                                      <br>
                                    </div>
                                    <div>For threshold smaller any value
                                      the above code works as expected
                                      and the output contains all
                                      key-value pairs.<br>
                                    </div>
                                    <div>If I increase the threshold to
                                      1 some pairs are missing in the
                                      output although the respective
                                      value would be larger than the
                                      threshold.<br>
                                      <br>
                                    </div>
                                    <div>I tried to work out the error
                                      myself, but I could not get it to
                                      work as intended. I use the exact
                                      Tutorial setup with Oracle JDK 8 <br>
                                    </div>
                                    <div>on a CentOS 7 machine.<br>
                                      <br>
                                    </div>
                                    <div>As far as I understand the
                                      respective Iterable&lt;...&gt;=C2=A0 =
in
                                      the Reducer already contains all
                                      the observed values for a specific
                                      key. <br>
                                    </div>
                                    <div>Why is it possible that I am
                                      missing some of these key-value
                                      pairs then? It only fails in very
                                      few cases. The input file is
                                      pretty large - 250 MB -<br>
                                    </div>
                                    <div>so I also tried to increase the
                                      memory for the mapping and
                                      reduction steps but it did not
                                      help ( tried a lot of different
                                      stuff without success )<br>
                                    </div>
                                    <div><br>
                                    </div>
                                    <div>Maybe someone already
                                      experienced similar problems / is
                                      more experienced than I am.<br>
                                    </div>
                                    <div><br>
                                      <br>
                                    </div>
                                    <div>Thank you,<br>
                                      <br>
                                    </div>
                                    <div>Peter<br>
                                    </div>
                                  </div>
                                </blockquote>
                              </div>
                              <br>
                            </div>
                          </blockquote>
                          <br>
                        </div>
                      </div>
                    </div>
                  </blockquote>
                </div>
                <br>
              </div>
            </blockquote>
            <br>
          </div>
        </blockquote>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br></div>

--e89a8fb2005eb52ca80515ed24e2--