hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Compression using Hadoop...
Date Fri, 31 Aug 2007 17:43:09 GMT
Arun C Murthy wrote:
> One way to reap benefits of both compression and better parallelism is to use compressed
SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile
> 
> Of course this means you will have to do a conversion from .gzip to .seq file and load
it onto hdfs for your job, which should be fairly simple piece of code.

We really need someone to contribute an InputFormat for bzip files. 
This has come up before: bzip is a standard compression format that is 
splittable.

Another InputFormat that would be handy is zip.  Zip archives, unlike 
tar files, can be split by reading the table of contents.  So one could 
package a bunch of tiny files as a zip file, then the input format could 
split the zip file into splits that each contain a number of files 
inside the zip.  Each map task would then have to read the table of 
contents from the file, but could then seek directly to the files in its 
split without scanning the entire file.

Should we file jira issues for these?  Any volunteers who're interested 
in implementing these?

Doug

Mime
View raw message