hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: reading zip files
Date Thu, 11 May 2006 18:08:41 GMT
Vijay Murthi wrote:
> I am trying to process several gigs of zipped text files from a directory. If I unzip
them the size increase atleast 4 times and potentially I can run out of disk space. 
> Has anyone tried to read zipped text files directly from the input directory? 
> or anyone tried implementing a zip version of SequenceFileRecordReader.java and Filesplit?

SequenceFile currently supports per-record compression.  This is 
effective when your input records are fairly large (> a few kB).

What format are your zipped input files in?  Are there multiple records 
per file?  If so, how big are the records?  A future goal for 
SequenceFile is to support compression across multiple records, to make 
compression effective with small records.  Until then, compression of 
small records is difficult.  The best approach currently is to use an 
InputFormat that does not split files, but makes each file into a 
distinct split.  Then try to divide your data into approximately equal 
sized files that are each compressed.


View raw message