hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stuart Sierra" <m...@stuartsierra.com>
Subject Re: compressed/encrypted file
Date Thu, 05 Jun 2008 22:03:10 GMT
On Wed, Jun 4, 2008 at 6:52 PM, Arun C Murthy <arunc@yahoo-inc.com> wrote:
> With the current compression codecs available in Hadoop (zlib/gzip/lzo) it
> is not possible to split up a compressed file and then process it in a
> parallel manner. However once we get bzip2 to work we could split up the
> files as you are describing...

If it helps, on *nix you can split a compressed text file like this:
    gunzip -c original.txt.gz | split -a 5 -d -C 16777216 - output.txt.

Replace 16777216 (16MB) with however many (max) bytes you want per
split.  This is guaranteed to split only on line breaks.  You get
files named output.txt.00000, output.txt.00001, and so on.


View raw message