hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastien Crocquevieille (Jira)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-210) want InputFormat for zip files
Date Thu, 03 Jun 2021 05:57:00 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356189#comment-17356189
] 

Sebastien Crocquevieille commented on MAPREDUCE-210:
----------------------------------------------------

[~indrajeetapache], [~cutting] quick ping here.

Any chance of waking up this issue from its deep slumber?

If the previous work done on this issue is too dusty, as [~harisekhon] said there is a 3rd
party format here: 
https://github.com/cotdp/com-cotdp-hadoop/tree/master/src/main/java/com/cotdp/hadoop
With the associated blog post: [http://cutler.io/2012/07/hadoop-processing-zip-files-in-mapreduce/]

We'd all be terribly grateful :)

> want InputFormat for zip files
> ------------------------------
>
>                 Key: MAPREDUCE-210
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-210
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Doug Cutting
>            Assignee: indrajit
>            Priority: Major
>         Attachments: ZipInputFormat_fixed.patch
>
>
> HDFS is inefficient with large numbers of small files.  Thus one might pack many small
files into large, compressed, archives.  But, for efficient map-reduce operation, it is desireable
to be able to split inputs into smaller chunks, with one or more small original file per split.
 The zip format, unlike tar, permits enumeration of files in the archive without scanning
the entire archive.  Thus a zip InputFormat could efficiently permit splitting large archives
into splits that contain one or more archived files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message