Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 27320 invoked from network); 31 Dec 2009 20:21:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 31 Dec 2009 20:21:38 -0000 Received: (qmail 45081 invoked by uid 500); 31 Dec 2009 20:21:36 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 44979 invoked by uid 500); 31 Dec 2009 20:21:36 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 44969 invoked by uid 99); 31 Dec 2009 20:21:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Dec 2009 20:21:36 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of kuosenhao@gmail.com designates 209.85.216.201 as permitted sender) Received: from [209.85.216.201] (HELO mail-px0-f201.google.com) (209.85.216.201) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Dec 2009 20:21:27 +0000 Received: by pxi39 with SMTP id 39so9961617pxi.2 for ; Thu, 31 Dec 2009 12:21:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=owyQTHDMynlXfcpGOarj1Re3suB7nIQhIMmFqgvjadg=; b=FkY3/ZS1e5A/gVn2QKNlJzbRB+94E4D6yeRSQKm88g2H9ojSBCfIXepnaKDtfOSZpg LnGURIVsY8qMFc8KjKV/Qj6++RvYlbIh6gs0l/YqvOz0B8kBpkJZO0cou487LR0yLrEH gO5LjEH+9CDIOQPhw4X+58YWsTjsaAkScbl9Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=t2ZZmbGdIM9nwm6QkA0NO1M3tyu9xU6JXlCDiHyweQV4w5npgWaLmeinggW2nAiMPv eaT7lLCTwdTRmmeO1/sxigRj2meei9Lq3M41rAewsieRR6NZ9eZ7Q4Vxvv3WyVxO7caG goSRHg2Cba1eN4VZXiZa6A8HEFc7aigCJEDpg= MIME-Version: 1.0 Received: by 10.142.3.35 with SMTP id 35mr9439686wfc.84.1262290866209; Thu, 31 Dec 2009 12:21:06 -0800 (PST) In-Reply-To: References: Date: Thu, 31 Dec 2009 12:21:06 -0800 Message-ID: Subject: Re: How to ensure LzoTextInputFormat is used to generate input splits for .lzo files From: Steve Kuo To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=00504502c10daf0536047c0bfeb4 X-Virus-Checked: Checked by ClamAV on apache.org --00504502c10daf0536047c0bfeb4 Content-Type: text/plain; charset=ISO-8859-1 Digging around the new Job api with a rested brain came up with job.setInputFormatClass(LzoTextInputFormat.class); that solved the problem. On Thu, Dec 31, 2009 at 9:53 AM, Steve Kuo wrote: > I have followed > http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/and > http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ to build the > requisite hadoop-lzo jar and native .so files. (The jar and .so files were > built from Kevin Weil's git repository. Thanks Kevin.) I have configured > core-site.xml and mapred-site.xml as instructed to enable lzo for both map > and reduce output. The creation of lzo index also worked. The last step was > to replace TextInputFormat with LzoTextInputFormat. As I only have > > FileInputFormat.addInputPath(jobConf, new Path(inputPath)); > > it was replaced with > > LzoTextInputFormat.addInputPath(job, new Path(inputPath)); > > When I ran my MR job, I noticed that the new code was able to read in .lzo > input files and decompressed fine. The output was also lzo compressed. > However, only one map job was created for each input .lzo file indicating > that input splitting was not done by LzoTextInputFormat but more likely by > its parent such as FileInputFormat. There must be a way to ensure > LzoTextInputFormat is used in the Map task. How can this be done? > > Thanks in advance. > > --00504502c10daf0536047c0bfeb4--