hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: InputFormat for tarball
Date Tue, 19 Feb 2008 21:20:43 GMT

On Feb 19, 2008, at 12:31 PM, Doug Cutting wrote:

> Goel, Ankur wrote:
>> Hi All,
>>            Is there an input format available for reading from  
>> tarballs
>> (.tar.gz files) ?
> Not at present.  There is support for reading .gz files, but  
> not .tar files.  A problem is that that there's no way to read a  
> chunk of such archives without reading everything preceding that  
> chunk.  So, if such an InputFormat were written, it would be unable  
> to efficiently split the processing of an archive among map tasks.

Would it make sense to write a simple tool (maybe a Map-Reduce  
application) which given a tar will uncompress it and write it out as  
separate files? Folks can then run Map-Reduce applications on top of  
the uncompressed data...

Blimey! This should be supported by distcp! *smile*


> Doug

View raw message