hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Feinberg" <a...@socialmedia.com>
Subject Re: Has anyone experienced a major lag reading gzipped files?
Date Tue, 19 Aug 2008 22:37:11 GMT

When using a gzipped file, there is a hard rule of a single map process per file
-- whereas when using a plain text file, the file is "sharded" into
the amount of maps
that hadoop finds to be optimal.

It may help you to (if you want to keep using compressed files), to divide the
files into optimal size chunks (with the number of records that hadoop
itself would
divide the file into) and then compress these chunks.


>  Has anyone been having issues with Hadoop jobs involving a large
> collection of gzipped files, specifically on EC2?  I currently have a job
> set up in which I take as input about 360 gzipped log files within HDFS,
> totalling to about 30GB of compressed data.  I have noticed that if I leave
> these files compressed, the JobClient hangs at 0% map, 0% reduce, and will
> eventually jump to 100% map and reduce (after a rather long time) without
> having reported any progress in between.
> However, if I unzip the files before pushing them to HDFS, the job starts
> almost immediately.  I currently have a Cascading script that unzips the log
> files from a local directory while pushing them into HDFS.  This approach,
> however, is rather brute force and takes an incredibly long time.

Alex Feinberg
Platform Engineer, SocialMedia Networks

View raw message