Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F4C318EA4 for ; Sat, 26 Sep 2015 00:52:34 +0000 (UTC) Received: (qmail 39444 invoked by uid 500); 26 Sep 2015 00:52:30 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 39347 invoked by uid 500); 26 Sep 2015 00:52:30 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 39337 invoked by uid 99); 26 Sep 2015 00:52:29 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 26 Sep 2015 00:52:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 4B1D0C3E15 for ; Sat, 26 Sep 2015 00:52:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.981 X-Spam-Level: ** X-Spam-Status: No, score=2.981 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id WWT7UHGWVgL5 for ; Sat, 26 Sep 2015 00:52:16 +0000 (UTC) Received: from BAY004-OMC2S2.hotmail.com (bay004-omc2s2.hotmail.com [65.54.190.77]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 68F1C439B1 for ; Sat, 26 Sep 2015 00:52:16 +0000 (UTC) Received: from BAY181-W38 ([65.54.190.124]) by BAY004-OMC2S2.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008); Fri, 25 Sep 2015 17:52:09 -0700 X-TMN: [h457vQlrqKl9/Tk2cDBF3d8UoRNInaIg] X-Originating-Email: [hadooper@outlook.com] Message-ID: Content-Type: multipart/alternative; boundary="_4308fe48-65c5-4576-876f-53b2d3977a63_" From: R P To: "user@hadoop.apache.org" Subject: RE: CombineFileInputFormat with Gzip files Date: Fri, 25 Sep 2015 17:52:09 -0700 Importance: Normal In-Reply-To: References: ,,,, MIME-Version: 1.0 X-OriginalArrivalTime: 26 Sep 2015 00:52:09.0626 (UTC) FILETIME=[92D587A0:01D0F7F5] --_4308fe48-65c5-4576-876f-53b2d3977a63_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable It's creating temp files on the HDFS. See code below.Thanks for your respon= se through=2C I wrote my own record reader which is passing file splits to = LineRecordReader which works for my problem.=20 public CompressedCombineFileRecordReader(CombineFileSplit split=2C=0A= TaskAttemptContext context=2C Integer index) throws IOException {=0A= =0A= Configuration currentConf =3D context.getConfiguration()=3B=0A= this.path =3D split.getPath(index)=3B=0A= boolean isCompressed =3D findCodec(currentConf =2Cpath)=3B=0A= if(isCompressed)=0A= codecWiseDecompress(context.getConfiguration())=3B=0A= =0A= fs =3D this.path.getFileSystem(currentConf)=3B=0A= =0A= this.startOffset =3D split.getOffset(index)=3B=0A= =0A= if(isCompressed){=0A= this.end =3D startOffset + rlength=3B=0A= }else{=0A= this.end =3D startOffset + split.getLength(index)=3B=0A= dPath =3Dpath=3B=0A= }=0A= =0A= boolean skipFirstLine =3D false=3B=0A= =0A= fileIn =3D fs.open(dPath)=3B=0A= =0A= if(isCompressed) fs.deleteOnExit(dPath)=3B=0A= =0A= if (startOffset !=3D 0) {=0A= skipFirstLine =3D true=3B=0A= --startOffset=3B=0A= fileIn.seek(startOffset)=3B=0A= }=0A= reader =3D new LineReader(fileIn)=3B=0A= if (skipFirstLine) { =0A= startOffset +=3D reader.readLine(new Text()=2C 0=2C=0A= (int)Math.min((long)Integer.MAX_VALUE=2C end - startOffset))=3B= =0A= }=0A= this.pos =3D startOffset=3B=0A= }=0A= Date: Thu=2C 24 Sep 2015 14:38:45 +0530 Subject: Re: CombineFileInputFormat with Gzip files From: mathursharp@gmail.com To: user@hadoop.apache.org what sought of side effects? On Thu=2C Sep 24=2C 2015 at 2:35 PM=2C R P wrote: Thanks Harshit. That approach doesn't look good as it will write uncompress= ed data to HDFS resulting into job side effects. -R P Date: Thu=2C 24 Sep 2015 09:55:49 +0530 Subject: Re: CombineFileInputFormat with Gzip files From: mathursharp@gmail.com To: user@hadoop.apache.org CC: mapreduce-user@hadoop.apache.org Hi R P=2C Follow this link=2C http://www.ibm.com/developerworks/library/bd-hadoopcombine/ Regards=2C Harshit On Thu=2C Sep 24=2C 2015 at 4:46 AM=2C R P wrote: Hello All=2C What is the best way to process small Gzip files with CombineFileInputForma= t ? If possible please provide link to the documentation.Appreciate your h= elp.=20 Thanks=2C *Adding mapreduce-dev to the mailing list. From: hadooper@outlook.com To: user@hadoop.apache.org Subject: CombineFileInputFormat with Gzip files Date: Tue=2C 22 Sep 2015 18:29:05 -0700 Hello All=2C What is the best way to use CombineFileInputFormat with Gzip = files as input?=20 Thanks=2C =20 --=20 Harshit Mathur =20 --=20 Harshit Mathur = --_4308fe48-65c5-4576-876f-53b2d3977a63_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
It's creating temp files on the = HDFS. See code below.
Thanks for your response through=2C I wrote my ow= n record reader which is passing file splits to LineRecordReader which work= s for my problem. =3B


public CompressedCombineFil=
eRecordReader(CombineFileSplit split=2C=0A=
	      TaskAttemptContext context=2C Integer index) throws IOException {=0A=
	    =0A=
			Configuration currentConf =3D context.getConfiguration()=3B=0A=
		  	this.path =3D split.getPath(index)=3B=0A=
		  	boolean isCompressed =3D  findCodec(currentConf =2Cpath)=3B=0A=
		  	if(isCompressed)=0A=
		  		codecWiseDecompress(context.getConfiguration())=3B=0A=
	=0A=
		  	fs =3D this.path.getFileSystem(currentConf)=3B=0A=
		  	=0A=
		  	this.startOffset =3D split.getOffset(index)=3B=0A=
	=0A=
		  	if(isCompressed){=0A=
		  		this.end =3D startOffset + rlength=3B=0A=
		  	}else{=0A=
		  		this.end =3D startOffset + split.getLength(index)=3B=0A=
		  		dPath =3Dpath=3B=0A=
		  	}=0A=
		  	=0A=
		  	boolean skipFirstLine =3D false=3B=0A=
	    =0A=
	        fileIn =3D fs.open(dPath)=3B=0A=
	        =0A=
	        if(isCompressed)  fs.deleteOnExit(dPath)=3B=0A=
	        =0A=
	        if (startOffset !=3D 0) {=0A=
	        	skipFirstLine =3D true=3B=0A=
	        	--startOffset=3B=0A=
	        	fileIn.seek(startOffset)=3B=0A=
	        }=0A=
	        reader =3D new LineReader(fileIn)=3B=0A=
	        if (skipFirstLine) {  =0A=
	        	startOffset +=3D reader.readLine(new Text()=2C 0=2C=0A=
	        	(int)Math.min((long)Integer.MAX_VALUE=2C end - startOffset))=3B=
=0A=
	        }=0A=
	        this.pos =3D startOffset=3B=0A=
	  }=0A=




D= ate:=3B Thu=2C 24 Sep 2015 14:=3B38:=3B45 +0530
Subject:=3B = Re:=3B CombineFileInputFormat with Gzip files
From:=3B mathursharp= @gmail.com
To:=3B user@hadoop.apache.org

what s= ought of side effects?

On Thu=2C Sep 24=2C 2015 at 2:=3B35 PM=2C R P <=3Bhadooper@outlook.com>=3B wrote:=3B
Thanks =3BHarshit. That approach doesn't look good as it= will write uncompressed data to HDFS resulting into job side effects. = =3B
-
R P



Date:= =3B Thu=2C 24 Sep 2015 09:=3B55:=3B49 +0530
Subject&#= 58=3B Re:=3B CombineFileInputFormat with Gzip files
From:= =3B mathursh= arp@gmail.com
To:=3B user@hadoop.apache.org
CC:=3B mapreduce-user@h= adoop.apache.org


Regards=2C
Harshit

On Thu=2C Se= p 24=2C 2015 at 4:=3B46 AM=2C R P <=3Bhadooper@outlook.com>= =3B wrote:=3B
Hello All=2C

What is the = best way to process small Gzip files with CombineFileInputFormat ? =3B = If possible please provide link to the documentation.
Appreciate = your help. =3B

Thanks=2C

<= div>*Adding =3B =3Bmapreduce-dev to the mailing list.


=

From:=3B hadooper@outlook.com
To:=3B user@hadoop.apache.org
Subject&= #58=3B CombineFileInputFormat with Gzip files
Date:=3B Tue=2C 22 Sep = 2015 18:=3B29:=3B05 -0700

Hello All=2C
 =3B What is the best way to use Combin= eFileInputFormat with Gzip files as input? =3B

Thanks=2C

<= /div>



--
Harshit Mathur



--
Harshit Mathur
= --_4308fe48-65c5-4576-876f-53b2d3977a63_--