Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 54D1817B00 for ; Mon, 11 May 2015 17:28:42 +0000 (UTC) Received: (qmail 22484 invoked by uid 500); 11 May 2015 17:28:29 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 22364 invoked by uid 500); 11 May 2015 17:28:29 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 22354 invoked by uid 99); 11 May 2015 17:28:29 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 May 2015 17:28:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 95F6B1A297E for ; Mon, 11 May 2015 17:28:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id E3FN_omQdVjx for ; Mon, 11 May 2015 17:28:27 +0000 (UTC) Received: from mail-oi0-f49.google.com (mail-oi0-f49.google.com [209.85.218.49]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 9EA2C24D5F for ; Mon, 11 May 2015 17:28:26 +0000 (UTC) Received: by oift201 with SMTP id t201so110193869oif.3 for ; Mon, 11 May 2015 10:26:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=OPuaK2bABmfuzad0+w0710oM6CRZL4qZwuyBe3Bb8d4=; b=uMM+OV7W82DFi/M96lVP5CZ3FD5jJd+avZV4FLBO9uLhAISVTShRJ3C1ylg2Xu1Bcv SqKEgu1UKDKZU+YlsCz8RC+acLV0Bl+8LbdCkEfd19bkqf4SpYSuP1W263H7gu+6i/gR TYgad8jiYrVvOOCjo6BfZG59/v2kkvBhpTVF4IcckrRrMjPSbK0c8C6v4OP/X0q9wh/k kM+6nZr3yCIhMtx/N1len4drEMvBS3z9o6/RFNLlM1TzKtOgt0pmMNNp0qD1UO0J1HEG hSbuVjwC3Z5PMP1dISN117LBpRpD//rvt0bl/Oz5r9yzit6KPkrX0p7qM3K+fon3VBve xWew== MIME-Version: 1.0 X-Received: by 10.202.176.134 with SMTP id z128mr8368069oie.102.1431365209520; Mon, 11 May 2015 10:26:49 -0700 (PDT) Received: by 10.76.34.8 with HTTP; Mon, 11 May 2015 10:26:49 -0700 (PDT) In-Reply-To: References: Date: Mon, 11 May 2015 13:26:49 -0400 Message-ID: Subject: Re: Filtering by value in Reducer From: Shahab Yunus To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a113d3082db80b90515d1ae95 --001a113d3082db80b90515d1ae95 Content-Type: text/plain; charset=UTF-8 What is the type of the threshold variable? sum I believe is a Java int. Regards, Shahab On Mon, May 11, 2015 at 1:08 PM, Peter Ruch wrote: > Hi, > > I am currently playing around with Hadoop and have some problems when > trying to filter in the Reducer. > > I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with > some additional functionality > and added the possibility to filter by the specific value of each key - > e.g. only output the key-value pairs where [[ value > threshold ]]. > > Filtering Code in Reducer > ##################################### > > for (IntWritable val : values) { > sum += val.get(); > } > if ( sum > threshold ) { > result.set(sum); > context.write(key, result); > } > > ##################################### > > For threshold smaller any value the above code works as expected and the > output contains all key-value pairs. > If I increase the threshold to 1 some pairs are missing in the output > although the respective value would be larger than the threshold. > > I tried to work out the error myself, but I could not get it to work as > intended. I use the exact Tutorial setup with Oracle JDK 8 > on a CentOS 7 machine. > > As far as I understand the respective Iterable<...> in the Reducer > already contains all the observed values for a specific key. > Why is it possible that I am missing some of these key-value pairs then? > It only fails in very few cases. The input file is pretty large - 250 MB - > so I also tried to increase the memory for the mapping and reduction steps > but it did not help ( tried a lot of different stuff without success ) > > Maybe someone already experienced similar problems / is more experienced > than I am. > > > Thank you, > > Peter > --001a113d3082db80b90515d1ae95 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
What is the type of the threshold variable? sum I believe = is a Java int.

Regards,
Shahab

On Mon, May 11, 2015 = at 1:08 PM, Peter Ruch <rutschifengga@gmail.com> wrote= :
Hi= ,

I am currently playing around with Hadoop and have some= problems when trying to filter in the Reducer.

I extended the Word= Count v1.0 example from the 2.7 MapReduce Tutorial with some additional fun= ctionality
and added the possibility to filter by the specific value of = each key - e.g. only output the key-value pairs where [[ value > thresho= ld ]].

Filtering Code in Reducer
#####################################

for (IntWritable val := values) {
=C2=A0=C2=A0=C2=A0=C2=A0 sum +=3D val.get();
}
if ( sum= > threshold ) {
=C2=A0=C2=A0=C2=A0=C2=A0 result.set(sum);
=C2=A0= =C2=A0=C2=A0=C2=A0 context.write(key, result);
}

################= #####################

For threshold smaller any value the= above code works as expected and the output contains all key-value pairs.<= br>
If I increase the threshold to 1 some pairs are missing in th= e output although the respective value would be larger than the threshold.<= br>
I tried to work out the error myself, but I could not get= it to work as intended. I use the exact Tutorial setup with Oracle JDK 8 <= br>
on a CentOS 7 machine.

As far as I understa= nd the respective Iterable<...>=C2=A0 in the Reducer already contains= all the observed values for a specific key.
Why is it possi= ble that I am missing some of these key-value pairs then? It only fails in = very few cases. The input file is pretty large - 250 MB -
so = I also tried to increase the memory for the mapping and reduction steps but= it did not help ( tried a lot of different stuff without success )

Maybe someone already experienced similar problems / = is more experienced than I am.


Thank you,<= br>
Peter

--001a113d3082db80b90515d1ae95--