Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 61886 invoked from network); 17 Dec 2010 17:31:19 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Dec 2010 17:31:19 -0000 Received: (qmail 28130 invoked by uid 500); 17 Dec 2010 17:31:17 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 27958 invoked by uid 500); 17 Dec 2010 17:31:16 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 27950 invoked by uid 99); 17 Dec 2010 17:31:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Dec 2010 17:31:16 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Dec 2010 17:31:09 +0000 Received: by qwh6 with SMTP id 6so937734qwh.35 for ; Fri, 17 Dec 2010 09:30:48 -0800 (PST) Received: by 10.224.53.213 with SMTP id n21mr997405qag.399.1292607048161; Fri, 17 Dec 2010 09:30:48 -0800 (PST) MIME-Version: 1.0 Received: by 10.220.190.75 with HTTP; Fri, 17 Dec 2010 09:30:27 -0800 (PST) X-Originating-IP: [64.105.168.204] In-Reply-To: References: From: Ted Dunning Date: Fri, 17 Dec 2010 09:30:27 -0800 Message-ID: Subject: Re: InputFormat for a big file To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0015175cb8d2f0984804979e8712 --0015175cb8d2f0984804979e8712 Content-Type: text/plain; charset=ISO-8859-1 a) this is a small file by hadoop standards. You should be able to process it by conventional methods on a single machine in about the same time it takes to start a hadoop job that does nothing at all. b) reading a single line at a time is not as inefficient as you might think. If you write a mapper that reads each line, converts to an integer and outputs a key consisting of a constant integer and the data you read, the mapper will process the data reasonably quickly. If you add a combiner and a reducer that add up numbers in a list, then the amount of data spilled will be nearly zero. On Fri, Dec 17, 2010 at 7:58 AM, madhu phatak wrote: > Hi > I have a very large file of size 1.4 GB. Each line of the file is a number > . > I want to find the sum all those numbers. > I wanted to use NLineInputFormat as a InputFormat but it sends only one > line > to the Mapper which is very in efficient. > So can you guide me to write a InputFormat which splits the file > into multiple Splits and each mapper can read multiple > line from each split > > Regards > Madhukar > --0015175cb8d2f0984804979e8712--