hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4652) RAgzip: multiple map tasks for a large gzipped file
Date Wed, 01 Apr 2009 01:29:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694351#action_12694351

Chris Douglas commented on HADOOP-4652:

This will be an excellent addition. Just a few questions/comments/nits:

* RAGZIPException doesn't seem necessary. In the current patch its only special use is commented
out in AccessPointController. Would IOException be sufficient?
* TestRAGZIPInputFormat crashed using an old version of zlib:
    [junit] Running org.apache.hadoop.mapred.TestRAGZIPInputFormat
    [junit] #
    [junit] # An unexpected error has been detected by Java Runtime Environment:
    [junit] #
    [junit] #  SIGSEGV (0xb) at pc=0x0000002a95f01d4d, pid=2055, tid=1076017504
    [junit] #
    [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (11.0-b15 mixed mode linux-amd64)
    [junit] # Problematic frame:
    [junit] # V  [libjvm.so+0x33cd4d]
    [junit] #
    [junit] # An error report file with more information is saved as:
    [junit] # /snip/hadoop/hs_err_pid2055.log
    [junit] #
    [junit] # If you would like to submit a bug report, please visit:
    [junit] #   http://java.sun.com/webapps/bugreport/crash.jsp
    [junit] #
    [junit] Running org.apache.hadoop.mapred.TestRAGZIPInputFormat
    [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
    [junit] Test org.apache.hadoop.mapred.TestRAGZIPInputFormat FAILED (crashed)
It would be better if this could read the zlib version and throw an exception if it attempts
to use unsupported features (assuming this is the cause of the crash).
* The RAGZIP\* classes (and classes added to lib) probably belong in a new package, mapred.lib.zip
or something similar.
* It's worth mentioning that much of the mapred package is deprecated; it's worth considering
how this might be written using the classes in the o.a.h.mapreduce package (HADOOP-1230).
Not as part of this patch, of course, but in the future.
* HADOOP-5406 seems to only affect the Compressor, and this only uses ZlibDecompressor::setDictionary;
it doesn't affect this patch, right?
* AccessPointController::existMetaFileOfSameSpanSize seems to have no callers and is probably
too tolerant of exceptions. Even after reading the javadoc, I'm still unsure of its purpose.
* Why do the TestRAGZIPInputFormat tests ignore NullPointerException?
* ZlibDecompressor:426 should call {{init(AUTODETECT_GZIP_ZLIB)}} instead of {{init(47)}}

> RAgzip: multiple map tasks for a large gzipped file
> ---------------------------------------------------
>                 Key: HADOOP-4652
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4652
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io, mapred, native
>    Affects Versions: 0.18.3, 0.19.0
>            Reporter: Daehyun Kim
>            Assignee: Daehyun Kim
>            Priority: Minor
>         Attachments: HADOOP-4652-v2.patch, HADOOP-4652-v3.patch, HADOOP-4652.path
> Currently, the hadoop processes gzipped files with only one map.
> We have made a patch that enables multiple map tasks for one large gzipped file. We call
the patch RAgzip.
> To process multiple map tasks for gzipped file, you may use RAgzip by just changing InputFormat
to RAGZIPInputFormat.
> The option used in RAGZIPInputFormat can be found at the javadoc of RAGZIPInputFormat
> RAgzip uses zlib's inflatePrime function which supports random access on a gzipped file.

> Since the inflatePrime is supported from the version of, it requires zlib
or higher. (We tested on zlib 1.2.3)
> RAgzip requires the preprocessing step that creates an access point (.ap) file, which
is like the index of the gzipped file chunks. 
> The access point(.ap) file is located in same path of the gzipped file.
> If there is a "/user/hadoop/test.gz", the .ap file is created with "/user/hadoop/test.gz.ap".

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message