Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 52041 invoked from network); 8 Mar 2011 12:30:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Mar 2011 12:30:49 -0000 Received: (qmail 51013 invoked by uid 500); 8 Mar 2011 12:30:46 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 50958 invoked by uid 500); 8 Mar 2011 12:30:46 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 50950 invoked by uid 99); 8 Mar 2011 12:30:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011 12:30:45 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 209.85.161.48 is neither permitted nor denied by domain of james@tynt.com) Received: from [209.85.161.48] (HELO mail-fx0-f48.google.com) (209.85.161.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Mar 2011 12:30:39 +0000 Received: by fxm2 with SMTP id 2so6665702fxm.35 for ; Tue, 08 Mar 2011 04:30:19 -0800 (PST) MIME-Version: 1.0 Received: by 10.223.110.38 with SMTP id l38mr6281390fap.116.1299587387446; Tue, 08 Mar 2011 04:29:47 -0800 (PST) Received: by 10.223.1.202 with HTTP; Tue, 8 Mar 2011 04:29:47 -0800 (PST) In-Reply-To: References: <4D75F9B7.2090203@gmail.com> Date: Tue, 8 Mar 2011 05:29:47 -0700 Message-ID: Subject: Re: How to count rows of output files ? From: James Seigel To: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Simplest case, if you need a sum of the lines for A,B, and C is to look at the output that is normally generated which tells you "Reduce output records". This can be accessed like the others are telling you, as a counter, which you could access and explicitly print out or with your eyes as the summary of the job when it is done. Cheers James. On Tue, Mar 8, 2011 at 3:29 AM, Harsh J wrote: > I think the previous reply wasn't very accurate. So you need a count > per-file? One way I can think of doing that, via the job itself, is to > use Counter to count the "name of the output + the task's ID". But it > would not be a good solution if there are several hundreds of tasks. > > A distributed count can be performed on a single file, however, using > an identity mapper + null output and then looking at map-input-records > counter after completion. > > On Tue, Mar 8, 2011 at 3:54 PM, Harsh J wrote: >> Count them as you sink using the Counters functionality of Hadoop >> Map/Reduce (If you're using MultipleOutputs, it has a way to enable >> counters for each name used). You can then aggregate related counters >> post-job, if needed. >> >> On Tue, Mar 8, 2011 at 3:11 PM, Jun Young Kim wrote: >>> Hi. >>> >>> my hadoop application generated several output files by a single job. >>> (for example, A, B, C are generated as a result) >>> >>> after finishing a job, I want to count each files' row counts. >>> >>> is there any way to count each files? >>> >>> thanks. >>> >>> -- >>> Junyoung Kim (juneng603@gmail.com) >>> >>> >> >> >> >> -- >> Harsh J >> www.harshj.com >> > > > > -- > Harsh J > www.harshj.com >