hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4652) RAgzip: multiple map tasks for a large gzipped file
Date Tue, 18 Nov 2008 08:55:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648530#action_12648530
] 

Edward J. Yoon commented on HADOOP-4652:
----------------------------------------

{quote}
Dose Hudson system use "-Dcompile.native=true" option?
Dose Hudson system have zlib 1.2.2.4 or higher?
{quote}

Nope, not yet. I don't see native compile option in the build script but, this issue of native
compile option and some libraries was filed to HADOOP-3020

And, new attached looks good to me. +1
FYI, you can just add new patches to the end of the list instead of deleting them.


> RAgzip: multiple map tasks for a large gzipped file
> ---------------------------------------------------
>
>                 Key: HADOOP-4652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4652
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred, native
>    Affects Versions: 0.20.0
>            Reporter: Daehyun Kim
>         Attachments: HADOOP-4652.path
>
>
> Currently, the hadoop processes gzipped files with only one map.
> We have made a patch that enables multiple map tasks for one large *gzipped* file. We
call the patch RAgzip.
> To process multiple map tasks for gzipped file, you may use RAgzip by just changing InputFormat
to RAGZIPInputFormat.
> The option used in RAGZIPInputFormat can be found at the javadoc of RAGZIPInputFormat
part.
> RAgzip uses zlib's inflatePrime function which supports random access on a gzipped file.

> Since the inflatePrime is supported from the version of 1.2.2.4, it requires zlib 1.2.2.4
or higher. (We tested on zlib 1.2.3)
> RAgzip requires the preprocessing step that creates an access point (.ap) file, which
is like the index of the gzipped file chunks. 
> The access point(.ap) file is located in same path of the gzipped file.
> If there is a "/user/hadoop/test.gz", the .ap file is created with "/user/hadoop/test.gz.ap".
> We made two patches. 
> 1. One makes changes in the source of the hadoop core. This is the main patch. 
> If the zlib version of the hadoop cluster is greater than 1.2.2.4, you should use this
patch.
> 2. On the other hand, if there is a computer with zlib version less than 1.2.2.4 in hadoop
cluster, you should use the other patch. 
> This patch uses static link library of the zlib. So if you compile this patch once at
a computer with zlib version greater than 1.2.2.4, RAgzip can be used in hadoop cluster even
if a computer with zlib version less than 1.2.2.4 exists in the cluster.
> As you know, second patch creates jar file(build/contrib/ragzip/hadoop-x.xx.x-dev-ragzip.jar)
as a result of its installation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message