hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: [proposal] RAgzip: multiple map tasks for a large gzipped file
Date Wed, 05 Nov 2008 13:40:33 GMT
Hi, welcome your contribute :)

Here's my few comments,

1) We can't distribute any GPL or LGPL products with Hadoop. AFAIK,
zlib was under a license that pure GPL. Should it be need a zlib in
lib folder?
2) Yes, you can create Jira issue for this thing. If you attach your
patch and submit patch, it'll be reviewed by active committers.

Yours,
Edward

2008/11/5 김대현[로그모델링] <daehyun.kim@nhncorp.com>:
> Hello,
>
> I'm new to this mailing list, and this is the first trial of contribution.
>
>
>
> We have made a patch that enables multiple map tasks for one large *gzipped* file. We
call the patch RAgzip, which is the abbreviation of Random Access gzip. It is like HADOOP-3646,
which supports a big bzip2 file, and is an alternative approach of PIG-42 which requires re-compression.
>
>
>
> RAgzip uses zlib's inflatePrime function which supports random access on a gzipped file.
Since the inflatePrime is supported from the version of 1.2.2.4, it requires zlib 1.2.2.4
or higher. (We tested on zlib 1.2.3)
>
>
>
> RAgzip requires the preprocessing step that creates an access point (.ap) file, which
is like the index of the gzipped file chunks. (Unfortunately, the preprocessing step seems
to be sequential, that is, we cannot find the way to parallelize.)
>
>
>
> RAgzip splits the gzipped file using the .ap file. To be more specific, RAgzip reads
the .ap file, get the start position and the compression information of a partition of the
gzipped file, decompress the partition and feed it to the map task input when a map task starts.
>
>
>
> In short, you may use RAgzip by just changing InputFormat to RAGZIPInputFormat.
>
>
>
> We have made RAgzip in two package types as follows:
>
> 1. jar
>
> - does not touch the Hadoop core
>
>  - solves zlib version conflict problem by static linking zlib 1.2.3.
>
> 2. hadoop patch
>
> - integrated into Hadoop core
>
> - patches ZlibDecompressor.{c,java}: libhadoop.so changes
>
>  - the version of zlib on the system should be 1.2.2.4 or higher.
>
>
>
> What I want to ask is:
>
> How to contribute RAgzip to Hadoop? May I just submit the hadoop patch (package 2) to
JIRA?
>
> I have read http://wiki.apache.org/hadoop/HowToContribute and changed our source code
to meet the coding style.
>
>
>
> Any comments will be appreciated.
>
> Thank you.
>
>
>
> - Daehyun Kim
>
>
>
>



-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardyoon@apache.org
http://blog.udanax.org
Mime
View raw message