hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xing Shi (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1424) prevent Merger fd leak when there are lots empty segments in mem
Date Thu, 28 Jan 2010 16:11:36 GMT
prevent Merger fd  leak when there are lots empty segments in mem
-----------------------------------------------------------------

                 Key: MAPREDUCE-1424
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1424
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: task
            Reporter: Xing Shi


The Merger will open too many files on disk, when there are too many empty segments in shuffle
mem.


We process larger data , eg. > 100T,in one Job. And we use our partitioner to partition
the map output,and one map output will wholely shuffle to one reduce。So the other reduce
will get lots of empty segments.

                   whole
map_n  |  ̄ ̄ ̄ ̄--->      reduce1
             |    empty    
             |  ̄ ̄ ̄ ̄--->      reduce2
             |    empty
             |  ̄ ̄ ̄ ̄--->      reduce3
             |   empty
             |  ̄ ̄ ̄ ̄--->      reduce4

Because, our input data is bigger, so there are lots of map(10^5). And mostly there are several
thousands maps to one reduce, and several thousands empty segments. 

For example:
     1000 mapOutput(on disk) + 3000 empty segments(in mem)

Then, as the io.sort.factor=100

    in first merge cycle, the merger will merge 10+3000 segments [ by getPassFactor (1000
- 1)%100 + 1 + 30000 ],because there is no real data in mem, then we should use the left
990 mapOutput to replace the empty 3000 mem segments, then we open 1000 fd.

    Once there are several reduce on one taskTracker, we will open several thousand fds.


    I think we can use first collection to remove the empty segments, moreover in shuffle
phase, we also can not add the segment into mem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message