hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Re: Spill file compression
Date Wed, 07 Nov 2012 13:12:59 GMT
OK, just wanted to confirm. Maybe there is another problem then. I just
looked at the task logs and there were ~200 spills recorded for a single
task, only afterwards there was a merge phase. In my case, 200 spills are
about 2GB (uncompressed). One map output record easily fits into the
in-memory buffer, in fact, a few records fit into it. But Hadoop decides to
write gigabytes of spill to disk and it seems that the disk I/O and merging
make everything really slow. There doesn't seem to be a
max.num.spills.for.combine though. Is there any typical advise for this
kind of situation? Also, is there a way to see the size of the compressed
spill files to get a better idea about the file sizes I'm dealing with?

2012/11/7 Harsh J <harsh@cloudera.com>

> Yes we do compress each spill output using the same codec as specified
> for map (intermediate) output compression. However, the counted bytes
> may be counting decompressed values of the records written, and not
> post-compressed ones.
> On Wed, Nov 7, 2012 at 6:02 PM, Sigurd Spieckermann
> <sigurd.spieckermann@gmail.com> wrote:
> > Hi guys,
> >
> > I've encountered a situation where the ratio between "Map output bytes"
> and
> > "Map output materialized bytes" is quite huge and during the map-phase
> data
> > is spilled to disk quite a lot. This is something I'll try to optimize,
> but
> > I'm wondering if the spill files are compressed at all. I set
> > mapred.compress.map.output=true and
> >
> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
> > and everything else seems to be working correctly. Does Hadoop actually
> > compress spills or just the final spill after finishing the entire
> map-task?
> >
> > Thanks,
> > Sigurd
> --
> Harsh J

View raw message