hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Goel, Ankur" <Ankur.G...@corp.aol.com>
Subject HADOOP-1824 | Proposed implementation
Date Wed, 26 Dec 2007 12:06:11 GMT
   I am working on developing an InputFormat for zip files
as required by HADOOP-1824. For the same I would like to propose
a simple approach and invite comments and suggestions from the 
community for my implementation.

Implementation Approach

1. Implement class ZipInputFormat to extend FileInputFormat.

2. Override the getSplits() method to read each file's
   InputStream and construct a ZipInputStream out of it.

3. Create FileSplits in a way that each file split has the following
	*  FileSplit.start = start index of a zip entry.
      *  FileSplit.length = end index of a zip entry.
      *  fileSplit.file = Zip file.
      *  Sum of compressed size of zip entries <= splitSize

   For e.g. start = 3, length = 6 signifies that zip entries 3 to 6 
   will be read from the zip file of this split.

4. Implement class ZipRecordReader to read each zip entry in its split
   Using LineRecordReader.

I think I might be required to deal with compressionCodecFatory and
classes related to compression. How exactly, is not very clear to me.
So any hints here would be useful.

Apart from the above please let me know if there is anything that I am 


View raw message