hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 김대현[로그모델링] <daehyun....@nhncorp.com>
Subject RE: [proposal] RAgzip: multiple map tasks for a large gzipped file
Date Fri, 07 Nov 2008 09:29:07 GMT
Hello, Milind Bhandarkar.

Thank you for your comment.

> Is there a Ragzip output fromat that produces the .ap (access points) file
> while writing out the gzipped file on HDFS ?
Currently, no.

I have considered making .ap file while writing out gzipped file.
It requires my understanding of gzip compression process.
But, currently, I don't have much knowledge about the details of zlib. I just used some functions
of zlib.
For the simultaneous process, I have to study the gzip (and zlib) source.
Anyway, I'll try it.

Most of the data files that we have received were gzipped.
So, we developed preprocessing utility, which scans existing gzip and makes .ap file,
instead of compression utility or a class that implements OutputFormat.
And, in fact, preprocessing stage does not take so long.
It took about 70 seconds for 1.7G gzipped file in our test.
Once the .ap file is created, you can reuse this .ap file for the next job.

Thanks.

- Daehyun

-----Original Message-----
From: Milind Bhandarkar [mailto:milindb@yahoo-inc.com] 
Sent: Friday, November 07, 2008 1:19 AM
To: core-dev@hadoop.apache.org
Cc: 정주원[로그모델]
Subject: Re: [proposal] RAgzip: multiple map tasks for a large gzipped file

Daehyun,

Is there a Ragzip output fromat that produces the .ap (access points) file
while writing out the gzipped file on HDFS ? Because that will eliminate the
preprocessing stage for gzipped files.

Aside from that, I think it will be a great addition to Hadoop.

- milind


On 11/5/08 5:06 AM, "김대현[로그모델링]" <daehyun.kim@nhncorp.com> wrote:

> Hello,
> 
> I’m new to this mailing list, and this is the first trial of contribution.
> 
> 
> 
> We have made a patch that enables multiple map tasks for one large *gzipped*
> file. We call the patch RAgzip, which is the abbreviation of Random Access
> gzip. It is like HADOOP-3646, which supports a big bzip2 file, and is an
> alternative approach of PIG-42 which requires re-compression.
> 
> 
> 
> RAgzip uses zlib's inflatePrime function which supports random access on a
> gzipped file. Since the inflatePrime is supported from the version of 1.2.2.4,
> it requires zlib 1.2.2.4 or higher. (We tested on zlib 1.2.3)
> 
> 
> 
> RAgzip requires the preprocessing step that creates an access point (.ap)
> file, which is like the index of the gzipped file chunks. (Unfortunately, the
> preprocessing step seems to be sequential, that is, we cannot find the way to
> parallelize.)
> 
> 
> 
> RAgzip splits the gzipped file using the .ap file. To be more specific, RAgzip
> reads the .ap file, get the start position and the compression information of
> a partition of the gzipped file, decompress the partition and feed it to the
> map task input when a map task starts.
> 
> 
> 
> In short, you may use RAgzip by just changing InputFormat to
> RAGZIPInputFormat.
> 
> 
> 
> We have made RAgzip in two package types as follows:
> 
> 1. jar
> 
> - does not touch the Hadoop core
> 
>   - solves zlib version conflict problem by static linking zlib 1.2.3.
> 
> 2. hadoop patch
> 
> - integrated into Hadoop core
> 
> - patches ZlibDecompressor.{c,java}: libhadoop.so changes
> 
>   - the version of zlib on the system should be 1.2.2.4 or higher.
> 
> 
> 
> What I want to ask is:
> 
> How to contribute RAgzip to Hadoop? May I just submit the hadoop patch
> (package 2) to JIRA?
> 
> I have read http://wiki.apache.org/hadoop/HowToContribute and changed our
> source code to meet the coding style.
> 
> 
> 
> Any comments will be appreciated.
> 
> Thank you.
> 
> 
> 
> - Daehyun Kim
> 
> 
> 


-- 
Milind Bhandarkar
Y!IM: GridSolutions
408-349-2136 
(milindb@yahoo-inc.com)




Mime
View raw message