hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Billy Pearson (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-5539) o.a.h.mapred.Merger not maintaining map out compression on intermediate files
Date Fri, 20 Mar 2009 06:52:50 GMT
o.a.h.mapred.Merger not maintaining map out compression on intermediate files

                 Key: HADOOP-5539
                 URL: https://issues.apache.org/jira/browse/HADOOP-5539
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.19.1
         Environment: 0.19.2-dev, r753365 
            Reporter: Billy Pearson
             Fix For: 0.19.2, 0.20.0

hadoop-site.xml :
mapred.compress.map.output = true

map output files are compressed but when the in memory merger closes 
on the reduce the on disk merger runs to reduce input files to <= io.sort.factor if needed.

when this happens it outputs files called intermediate.x files these 
do not maintain compression setting the writer (o.a.h.mapred.Merger.class line 432)
passes the codec but I added some logging and its always null map output compression set true
or false.

This causes task to fail if they can not hold the uncompressed size of the data of the reduce
its holding
I thank this is just and oversight of the codec not getting set correctly for the on disk

2009-03-20 01:30:30,005 INFO org.apache.hadoop.mapred.Merger: Merging 30 intermediate segments
out of a total of 3000
2009-03-20 01:30:30,005 INFO org.apache.hadoop.mapred.Merger: intermediate.1 used codec: null

I added 
          // added my me
	   if (codec != null){
	     LOG.info("intermediate." + passNo + " used codec: " + codec.toString());
	   } else {
	     LOG.info("intermediate." + passNo + " used codec: Null");
	   // end added by me
Just before the creation of the writer o.a.h.mapred.Merger.class line 432
and it outputs the second line above.

I have confirmed this with the logging and I have looked at the files on the disk of the tasktracker.
I can read the data in 
the intermediate files clearly telling me that there not compressed but I can not read the
map.out files direct from the map output
telling me the compression is working on the map end but not on the on disk merge that produces
the intermediate.

I can see no benefit for these not maintaining the compression setting and as it looks they
where intended to maintain it.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message