hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1824) want InputFormat for zip files
Date Mon, 03 Nov 2008 11:18:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644691#action_12644691
] 

Steve Loughran commented on HADOOP-1824:
----------------------------------------

The most tested/stable Apache-licensed Java unzip code is in Ant's codebase; you can either
take/fork that or try and get the changes back in, which, with suitable tests, I am sure will
be happily accepted. 

> want InputFormat for zip files
> ------------------------------
>
>                 Key: HADOOP-1824
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1824
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.2
>            Reporter: Doug Cutting
>         Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack many small
files into large, compressed, archives.  But, for efficient map-reduce operation, it is desireable
to be able to split inputs into smaller chunks, with one or more small original file per split.
 The zip format, unlike tar, permits enumeration of files in the archive without scanning
the entire archive.  Thus a zip InputFormat could efficiently permit splitting large archives
into splits that contain one or more archived files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message