hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind Bhandarkar <mili...@yahoo-inc.com>
Subject Re: [proposal] RAgzip: multiple map tasks for a large gzipped file
Date Thu, 06 Nov 2008 16:19:17 GMT
Daehyun,

Is there a Ragzip output fromat that produces the .ap (access points) file
while writing out the gzipped file on HDFS ? Because that will eliminate the
preprocessing stage for gzipped files.

Aside from that, I think it will be a great addition to Hadoop.

- milind


On 11/5/08 5:06 AM, "김대현[로그모델링]" <daehyun.kim@nhncorp.com> wrote:

> Hello,
> 
> I’m new to this mailing list, and this is the first trial of contribution.
> 
> 
> 
> We have made a patch that enables multiple map tasks for one large *gzipped*
> file. We call the patch RAgzip, which is the abbreviation of Random Access
> gzip. It is like HADOOP-3646, which supports a big bzip2 file, and is an
> alternative approach of PIG-42 which requires re-compression.
> 
> 
> 
> RAgzip uses zlib's inflatePrime function which supports random access on a
> gzipped file. Since the inflatePrime is supported from the version of 1.2.2.4,
> it requires zlib 1.2.2.4 or higher. (We tested on zlib 1.2.3)
> 
> 
> 
> RAgzip requires the preprocessing step that creates an access point (.ap)
> file, which is like the index of the gzipped file chunks. (Unfortunately, the
> preprocessing step seems to be sequential, that is, we cannot find the way to
> parallelize.)
> 
> 
> 
> RAgzip splits the gzipped file using the .ap file. To be more specific, RAgzip
> reads the .ap file, get the start position and the compression information of
> a partition of the gzipped file, decompress the partition and feed it to the
> map task input when a map task starts.
> 
> 
> 
> In short, you may use RAgzip by just changing InputFormat to
> RAGZIPInputFormat.
> 
> 
> 
> We have made RAgzip in two package types as follows:
> 
> 1. jar
> 
> - does not touch the Hadoop core
> 
>   - solves zlib version conflict problem by static linking zlib 1.2.3.
> 
> 2. hadoop patch
> 
> - integrated into Hadoop core
> 
> - patches ZlibDecompressor.{c,java}: libhadoop.so changes
> 
>   - the version of zlib on the system should be 1.2.2.4 or higher.
> 
> 
> 
> What I want to ask is:
> 
> How to contribute RAgzip to Hadoop? May I just submit the hadoop patch
> (package 2) to JIRA?
> 
> I have read http://wiki.apache.org/hadoop/HowToContribute and changed our
> source code to meet the coding style.
> 
> 
> 
> Any comments will be appreciated.
> 
> Thank you.
> 
> 
> 
> - Daehyun Kim
> 
> 
> 


-- 
Milind Bhandarkar
Y!IM: GridSolutions
408-349-2136 
(milindb@yahoo-inc.com)


Mime
View raw message