hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Costa <psdc1...@gmail.com>
Subject Re: compressed map intermediate files
Date Tue, 15 Feb 2011 10:49:08 GMT
As I understand from the log files that I put, in the example, since
we've 3 Reduces, all spill 0 files will be merged to go to Reduce 0,
all spill 1 files will be merged to go to Reduce 1 and all spill 2
files will be merged to go to Reduce 2.

This means that, if we set compression on, it's the merged files that
are compressed?

Thanks,





On Tue, Feb 15, 2011 at 10:35 AM, Pedro Costa <psdc1978@gmail.com> wrote:
> Hi,
>
> I run two examples of a MR execution with the same input files and
> with 3 Reduce tasks defined. One example has the map-intermediate
> files compressed, and the other examples has uncompressed data. Below,
> I've put some debug lines that I put in the code.
>
> 1 - On the uncompressed data, the raw length is always smaller than
> the partition length, but on the compressed data, is not. Why in
> compressed data the raw length is bigger than the partition length?
>
> 2 - If we define the map-intermediate files as compressed, how the
> map-intermediate files are distributed to all reduces? Since we can
> split a compressed file, this means that each spill file is
> compressed? For example, Compressed(Spill idx 0) goes to Reduce 0,
> Compressed(Spill idx 1) goes to Reduce 1 and Compressed(Spill idx 2)
> goes to Reduce 2,
>
> Compressed data
>
> Spill idx 0 - SegmentStart: 0 Part length: 10560 Raw length: 27567
> Spill idx 1 - SegmentStart: 10560 Part length: 10029 Raw length: 26003
> Spill idx 2 - SegmentStart: 20589 Part length: 10142 Raw length: 26459
>
> Spill idx 0 - SegmentStart: 0 Part length: 10202 Raw length: 26785
> Spill idx 1 - SegmentStart: 10202 Part length: 9932 Raw length: 26100
> Spill idx 2 - SegmentStart: 20134 Part length: 9926 Raw length: 25821
>
> Spill idx 0 - SegmentStart: 0 Part length: 9410 Raw length: 24503
> Spill idx 1 - SegmentStart: 9410 Part length: 9849 Raw length: 25564
> Spill idx 2 - SegmentStart: 19259 Part length: 9489 Raw length: 24716
>
> Spill idx 0 - SegmentStart: 0 Part length: 1661 Raw length: 3440
> Spill idx 1 - SegmentStart: 1661 Part length: 1527 Raw length: 3160
> Spill idx 2 - SegmentStart: 3188 Part length: 1737 Raw length: 3750
>
>
>
> Non-compressed data
>
> Spill idx 0 - SegmentStart: 0 Part length: 27571 Raw length: 27567
> Spill idx 1 - SegmentStart: 27571 Part length: 26007 Raw length: 26003
> Spill idx 2 - SegmentStart: 53578 Part length: 26463 Raw length: 26459
>
> Spill idx 0 - SegmentStart: 0 Part length: 26789 Raw length: 26785
> Spill idx 1 - SegmentStart: 26789 Part length: 26104 Raw length: 26100
> Spill idx 2 - SegmentStart: 52893 Part length: 25825 Raw length: 25821
>
> Spill idx 0 - SegmentStart: 0 Part length: 24507 Raw length: 24503
> Spill idx 1 - SegmentStart: 24507 Part length: 25568 Raw length: 25564
> Spill idx 2 - SegmentStart: 50075 Part length: 24720 Raw length: 24716
>
> Spill idx 0 - SegmentStart: 0 Part length: 3444 Raw length: 3440
> Spill idx 1 - SegmentStart: 3444 Part length: 3164 Raw length: 3160
> Spill idx 2 - SegmentStart: 6608 Part length: 3754 Raw length: 3750
>
>
> Thanks,
>
> --
> Pedro
>



-- 
Pedro

Mime
View raw message