hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Re: Spill file compression
Date Wed, 07 Nov 2012 13:18:04 GMT
OK, I found the answer to one of my questions just now -- the location of
the spill files and their sizes. So, there's a discrepancy between what I
see and what you said about the compression. The total size of all spill
files of a single task matches with what I estimate for them to be
*without* compression. It seems they aren't compressed, but that's strange
because I definitely enabled compression the way I described.

2012/11/7 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>

> OK, just wanted to confirm. Maybe there is another problem then. I just
> looked at the task logs and there were ~200 spills recorded for a single
> task, only afterwards there was a merge phase. In my case, 200 spills are
> about 2GB (uncompressed). One map output record easily fits into the
> in-memory buffer, in fact, a few records fit into it. But Hadoop decides to
> write gigabytes of spill to disk and it seems that the disk I/O and merging
> make everything really slow. There doesn't seem to be a
> max.num.spills.for.combine though. Is there any typical advise for this
> kind of situation? Also, is there a way to see the size of the compressed
> spill files to get a better idea about the file sizes I'm dealing with?
> 2012/11/7 Harsh J <harsh@cloudera.com>
>> Yes we do compress each spill output using the same codec as specified
>> for map (intermediate) output compression. However, the counted bytes
>> may be counting decompressed values of the records written, and not
>> post-compressed ones.
>> On Wed, Nov 7, 2012 at 6:02 PM, Sigurd Spieckermann
>> <sigurd.spieckermann@gmail.com> wrote:
>> > Hi guys,
>> >
>> > I've encountered a situation where the ratio between "Map output bytes"
>> and
>> > "Map output materialized bytes" is quite huge and during the map-phase
>> data
>> > is spilled to disk quite a lot. This is something I'll try to optimize,
>> but
>> > I'm wondering if the spill files are compressed at all. I set
>> > mapred.compress.map.output=true and
>> >
>> mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
>> > and everything else seems to be working correctly. Does Hadoop actually
>> > compress spills or just the final spill after finishing the entire
>> map-task?
>> >
>> > Thanks,
>> > Sigurd
>> --
>> Harsh J

View raw message