Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3483E437D for ; Fri, 10 Jun 2011 16:47:13 +0000 (UTC) Received: (qmail 60403 invoked by uid 500); 10 Jun 2011 16:47:10 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 60347 invoked by uid 500); 10 Jun 2011 16:47:10 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 60339 invoked by uid 99); 10 Jun 2011 16:47:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jun 2011 16:47:10 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [128.135.249.245] (HELO authsmtp00.uchicago.edu) (128.135.249.245) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jun 2011 16:47:03 +0000 Received: from [165.68.219.77] ([165.68.219.77]) (authenticated bits=0) by authsmtp00.uchicago.edu (8.13.8/8.13.8) with ESMTP id p5AGkg2R025088 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO) for ; Fri, 10 Jun 2011 11:46:43 -0500 Message-ID: <4DF24A7B.5060803@uchicago.edu> Date: Fri, 10 Jun 2011 11:46:51 -0500 From: Shi Yu User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 MIME-Version: 1.0 To: common-user@hadoop.apache.org Subject: Re: Automatic line number in reducer output References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Yes, it works perfectly. Actually didn't realize the flexibility to employ different classes in combiner and reducer. In that case it would have a three layer architecture, think that would be interesting and useful. Shi On 6/10/2011 9:31 AM, Robert Evans wrote: > In this case you probably want two different classes. You can have the base Reducer class that adds in the line count, and then subclass it for the combiner, that sets a flag to not output the line numbers. > > --Bobby > > > On 6/9/11 12:57 PM, "Shi Yu" wrote: > > Hi, > > Thanks for the reply. The line count in new API works fine now, it was a > bug in my code. In new API, > > Iterator is changed to Iterable, > > but I didn't pay attention to that and was still using Iterator and hasNext(), Next() method. Surprisingly, the wrong code still ran and got output, but the line number count did not work and I think it was null value. After fixing that Iterable mistake, the code works fine. > > The remaining problem is when combiner and reducer are both implemented, the output is like > > 00000 00000 value1 > 00001 00000 value2 > 00002 00000 value3 > 00003 00001 value4 > 00004 00001 value5 > > The first column are counts from reducer, the second column are counts from combiner. I want to avoid the line counter in combiner, so my plan is to create another class which is almost the same as Reducer, but without the line count. I think it is doable to set Combiner and Reducer to different classes in jobconf, but I haven't tried it yet. > > Best, > > Shi > > > On 6/9/2011 8:49 AM, Robert Evans wrote: > >> What exactly is linecount being output as in the new APIs? >> >> --Bobby >> >> On 6/7/11 11:21 AM, "Shi Yu" wrote: >> >> Hi, >> >> I am wondering is there any built-in function to automatically add a >> self-increment line number in reducer output (like the relation DB >> auto-key). >> >> I have this problem because in 0.19.2 API, I used a variable linecount >> increasing in the reducer like: >> >> public static class Reduce extends MapReduceBase implements >> Reducer{ >> private long linecount = 0; >> >> public void reduce(Text key, Iterator values, >> OutputCollector output, Reporter reporter) throws >> IOException { >> >> //.....some code here >> linecount ++; >> output.collect(new Text(Long.toString(linecount)), var); >> >> } >> >> } >> >> >> However, I found that this is not working in 0.20.2 API, if I write the >> code like: >> >> public static class Reduce extends >> org.apache.hadoop.mapreduce.Reducer{ >> private long linecount = 0; >> >> public void reduce (Text key, Iterator values, >> org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, >> InterruptedException { >> >> //some code here >> linecount ++; >> context.write(new Text(Long.toString(linecount)),var); >> } >> } >> >> but it seems not working anymore. >> >> >> I would also like to know if there are combiner and reducer implemented, >> how to avoid that line number being written twice (cause I only want it >> in reducer, not in combiner). Thanks! >> >> >> Shi >> >> >> >> > > -- Postdoctoral Scholar Institute for Genomics and Systems Biology Department of Medicine, the University of Chicago Knapp Center for Biomedical Discovery 900 E. 57th St. Room 10148 Chicago, IL 60637, US Tel: 773-702-6799