Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FB1017E11 for ; Wed, 13 May 2015 02:13:26 +0000 (UTC) Received: (qmail 54758 invoked by uid 500); 13 May 2015 02:13:20 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 54640 invoked by uid 500); 13 May 2015 02:13:20 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 54630 invoked by uid 99); 13 May 2015 02:13:19 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2015 02:13:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 6D7CDC0DF2 for ; Wed, 13 May 2015 02:13:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.98 X-Spam-Level: *** X-Spam-Status: No, score=3.98 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id rnaUq14qzE90 for ; Wed, 13 May 2015 02:13:16 +0000 (UTC) Received: from mail-ob0-f170.google.com (mail-ob0-f170.google.com [209.85.214.170]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 971B5474C1 for ; Wed, 13 May 2015 02:13:16 +0000 (UTC) Received: by obcus9 with SMTP id us9so19565741obc.2 for ; Tue, 12 May 2015 19:12:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=GFsuhHbzLKv9341sFSvY18oPPDl/ja3Zmc0Br+bSSSc=; b=gKH7P3LiDFxlcEmnqoAxH+omI3HhCQISo1zN0nQTC2lsPSv/wbfaPaH2MrzyyRcFKF LWBOt7K2tiFLYQo1miiD35fh7l+6lDpAlJiYCCMF7DjULzS3u8hQqq+KhkKz5Qguva6Z D3Yq/q2ZXPFHIrtGn4pSSuZa3OS433/ObMp1fbsowu+amvdiWRUx8rNTRrjl/prVCqt2 qtgA7YQEQdHa4Hre3w9OW79MU4FVMFVj/Wt80G+LebLSYL4ljbIZJkbIihLyK90fVjLF bSEuUtB3RkPeyJaW9LlROCS030NJRpCI7qGIgpBw3VY5Z0V7xL23ltr0T0g3A6ebuWLC fIhA== X-Gm-Message-State: ALoCoQlhRJpZoQpTW8hUaBCdGACmu6AIOPre+RzcUiY/4Krc9d5AjleRNXnKwgyHyOYcF+ZLgHNo MIME-Version: 1.0 X-Received: by 10.182.56.15 with SMTP id w15mr3868346obp.2.1431483150827; Tue, 12 May 2015 19:12:30 -0700 (PDT) Received: by 10.182.38.135 with HTTP; Tue, 12 May 2015 19:12:30 -0700 (PDT) In-Reply-To: <55520CED.30605@gmail.com> References: <55510E7E.1030505@gmail.com> <5551EEFA.6060804@gmail.com> <55520CED.30605@gmail.com> Date: Wed, 13 May 2015 11:12:30 +0900 Message-ID: Subject: Re: Re: Re: Re: Filtering by value in Reducer From: =?UTF-8?B?RHJha2Xrr7zsmIHqt7w=?= To: user Content-Type: multipart/alternative; boundary=e89a8fb2005eb52ca80515ed24e2 --e89a8fb2005eb52ca80515ed24e2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, Did you try mapreduce local mode with smaller input data? Or write test case with MRUnit is very helpful for debugging. Thanks. Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D kt NexR On Tue, May 12, 2015 at 11:23 PM, Peter Ruch wrote: > Hi, > > No, I did not create any custom logs, I was only looking through the > "standard" logs. > I just started out with Hadoop and did not think of explicitly logging > that part of the code, > as I thought that I am simply missing a small detail that someone of you > might spot. > > But I will definitely look into the custom logging and post my findings. > > @ Shahab and Drake: Thank you very much for your help. > > > Best, > Peter > > > > On 12.05.2015 14:57, Shahab Yunus wrote: > > Have you tried explicitly printing or logging in you reducer around the > code that compares and then outputs the values? Maybe that will give you = a > clue that what is happening? Debug the threshold value that you get in th= e > reducer and whether that is what you have set or not (in case of when you > set it to greater than -1)? > > You can also try to use compare method for comparing IntWritables though = I > doubt that would make any difference. > > Shahab > On May 12, 2015 8:17 AM, "Peter Ruch" wrote: > >> Hi, >> >> I already skimmed through the logs but I could not find anything special= . >> >> I am just really confused why I am having this problem. >> >> If the Iterable<...> for a specific key contains all of the observed >> values - and it seems to do so >> otherwise the program wouldn't work correctly in the standard case with >> [[ threshold =3D -1 ]] - >> it should also work when I only write the key-value pairs to the output >> file that suffice the condition [[ sum > threshold ]]. >> >> Did I miss something? Maybe I have to handle these cases in a specific >> way, but I did not find anything about that online. >> >> >> Thank you for your help, >> >> Peter >> >> >> >> On 12.05.2015 12:35, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrote: >> >> Hi, Peter >> >> The missing records, they are just gone without no logs? How about your >> reduce tasks logs? >> >> Thanks >> >> Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D >> kt NexR >> >> On Tue, May 12, 2015 at 5:18 AM, Peter Ruch >> wrote: >> >>> Hello, >>> >>> sum and threshold are both Integers. >>> for the threshold variable I first add a new resource to the >>> configuration - conf.addResource( ... ); >>> >>> later I get the threshold value from the configuration. >>> >>> Code >>> ##################################### >>> >>> private int threshold; >>> >>> public void setup( Context context ) { >>> >>> Configuration conf =3D context.getConfiguration(); >>> threshold =3D conf.getInt( "threshold", -1 ); >>> >>> } >>> >>> ##################################### >>> >>> >>> Best, >>> Peter >>> >>> >>> >>> On 11.05.2015 19:26, Shahab Yunus wrote: >>> >>> What is the type of the threshold variable? sum I believe is a Java int= . >>> >>> Regards, >>> Shahab >>> >>> On Mon, May 11, 2015 at 1:08 PM, Peter Ruch >>> wrote: >>> >>>> Hi, >>>> >>>> I am currently playing around with Hadoop and have some problems when >>>> trying to filter in the Reducer. >>>> >>>> I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial >>>> with some additional functionality >>>> and added the possibility to filter by the specific value of each key = - >>>> e.g. only output the key-value pairs where [[ value > threshold ]]. >>>> >>>> Filtering Code in Reducer >>>> ##################################### >>>> >>>> for (IntWritable val : values) { >>>> sum +=3D val.get(); >>>> } >>>> if ( sum > threshold ) { >>>> result.set(sum); >>>> context.write(key, result); >>>> } >>>> >>>> ##################################### >>>> >>>> For threshold smaller any value the above code works as expected and >>>> the output contains all key-value pairs. >>>> If I increase the threshold to 1 some pairs are missing in the output >>>> although the respective value would be larger than the threshold. >>>> >>>> I tried to work out the error myself, but I could not get it to work >>>> as intended. I use the exact Tutorial setup with Oracle JDK 8 >>>> on a CentOS 7 machine. >>>> >>>> As far as I understand the respective Iterable<...> in the Reducer >>>> already contains all the observed values for a specific key. >>>> Why is it possible that I am missing some of these key-value pairs >>>> then? It only fails in very few cases. The input file is pretty large = - 250 >>>> MB - >>>> so I also tried to increase the memory for the mapping and reduction >>>> steps but it did not help ( tried a lot of different stuff without suc= cess ) >>>> >>>> Maybe someone already experienced similar problems / is more >>>> experienced than I am. >>>> >>>> >>>> Thank you, >>>> >>>> Peter >>>> >>> >>> >>> >> >> > --e89a8fb2005eb52ca80515ed24e2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

Did you try mapreduce local mode wi= th smaller input data? Or write test case with MRUnit is very helpful for d= ebugging.

Thanks.

Drake =EB=AF=BC=EC=98=81=EA=B7=BC Ph.D
kt= NexR

On Tue, May 12, 2015 at 11:23 PM, Peter Ruch= <rutschifengga@gmail.com> wrote:
=20 =20 =20
Hi,

No, I did not create any custom logs, I was only looking through the "standard" logs.
I just started out with Hadoop and did not think of explicitly logging that part of the code,
as I thought that I am simply missing a small detail that someone of you might spot.

But I will definitely look into the custom logging and post my findings.

@ Shahab and Drake: Thank you very much for your help.


Best,
Peter



On 12.05.2015 14:57, Shahab Yunus wrote:

Have you tried explicitly printing or logging in you reducer around the code that compares and then outputs the values? Maybe that will give you a clue that what is happening? Debug the threshold value that you get in the reducer and whether that is what you have set or not (in case of when you set it to greater than -1)?

You can also try to use compare method for comparing IntWritables though I doubt that would make any difference.

Shahab

On May 12, 2015 8:17 AM, "Peter Ruch&= quot; <ru= tschifengga@gmail.com> wrote:
Hi,

I already skimmed through the logs but I could not find anything special.

I am just really confused why I am having this problem.

If the Iterable<...> for a specific key contains all of the observed values - and it seems to do so
otherwise the program wouldn't work correctly in the standard case with [[ threshold =3D -1 ]] -
it should also work when I only write the key-value pairs to the output file that suffice the condition [[ sum > threshold ]].

Did I miss something? Maybe I have to handle these cases in a specific way, but I did not find anything about that online.


Thank you for your help,

Peter



On 12.05.2015 12:35, Drake=EB=AF=BC=EC=98=81=EA=B7=BC wrot= e:
Hi, Peter

The missing records, they are just gone without no logs? How about your reduce tasks logs?

Thanks

Drake =EB=AF=BC=EC=98=81=EA=B7=BC = Ph.D
kt NexR

On Tue, May 12, 2015 at 5:18 AM, Peter Ruch <rutschifengga@gmail.com> wrote:
Hello,

sum and threshold are both Integers.
for the threshold variable I first add a new resource to the configuration - conf.addResource( ... );

later I get the threshold value from the configuration.

Code
#####################################

private int threshold;

public void setup( Context context ) {

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 Configuration conf =3D context.getConfiguration();
=C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0 =C2=A0=C2=A0 threshol= d =3D conf.getInt( "threshold", -1 );

}

#####################################


Best,
Peter



On 11.05.2015 19:26, Shahab Yunus wrote:
What is the type of the threshold variable? sum I believe is a Java int.

Regards,
Shahab

On Mon, May 11, 2015 at 1:08 PM, Peter Ruch <rutschi= fengga@gmail.com> wrote:
Hi,

I am currently playing around with Hadoop and have some problems when trying to filter in the Reducer.

I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with some additional functionality
and added the possibility to filter by the specific value of each key - e.g. only output the key-value pairs where [[ value > threshold ]].

Filtering Code in Reducer
################################= #####

for (IntWritable val : values) {
=C2=A0=C2=A0=C2=A0=C2=A0 sum +=3D val= .get();
}
if ( sum > threshold ) {
=C2=A0=C2=A0=C2=A0=C2=A0 result.set(s= um);
=C2=A0=C2=A0=C2=A0=C2=A0 context.writ= e(key, result);
}

#####################################

For threshold smaller any value the above code works as expected and the output contains all key-value pairs.
If I increase the threshold to 1 some pairs are missing in the output although the respective value would be larger than the threshold.

I tried to work out the error myself, but I could not get it to work as intended. I use the exact Tutorial setup with Oracle JDK 8
on a CentOS 7 machine.

As far as I understand the respective Iterable<...>=C2=A0 = in the Reducer already contains all the observed values for a specific key.
Why is it possible that I am missing some of these key-value pairs then? It only fails in very few cases. The input file is pretty large - 250 MB -
so I also tried to increase the memory for the mapping and reduction steps but it did not help ( tried a lot of different stuff without success )

Maybe someone already experienced similar problems / is more experienced than I am.


Thank you,

Peter






--e89a8fb2005eb52ca80515ed24e2--