hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stuart Sierra" <m...@stuartsierra.com>
Subject Re: compressed/encrypted file
Date Thu, 05 Jun 2008 22:03:10 GMT
On Wed, Jun 4, 2008 at 6:52 PM, Arun C Murthy <arunc@yahoo-inc.com> wrote:
> With the current compression codecs available in Hadoop (zlib/gzip/lzo) it
> is not possible to split up a compressed file and then process it in a
> parallel manner. However once we get bzip2 to work we could split up the
> files as you are describing...

If it helps, on *nix you can split a compressed text file like this:
    gunzip -c original.txt.gz | split -a 5 -d -C 16777216 - output.txt.

Replace 16777216 (16MB) with however many (max) bytes you want per
split.  This is guaranteed to split only on line breaks.  You get
files named output.txt.00000, output.txt.00001, and so on.

-Stuart

Mime
View raw message