hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "eric baldeschwieler (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-331) map outputs should be written to a single output file with an index
Date Wed, 28 Jun 2006 20:51:29 GMT
map outputs should be written to a single output file with an index
-------------------------------------------------------------------

         Key: HADOOP-331
         URL: http://issues.apache.org/jira/browse/HADOOP-331
     Project: Hadoop
        Type: Improvement

  Components: mapred  
    Versions: 0.3.2    
    Reporter: eric baldeschwieler
 Assigned to: Yoram Arnon 
     Fix For: 0.5.0


The current strategy of writing a file per target map is consuming a lot of unused buffer
space (causing out of memory crashes) and puts a lot of burden on the FS (many opens, inodes
used, etc).  

I propose that we write a single file containing all output and also write an index file IDing
which byte range in the file goes to each reduce.  This will remove the issue of buffer waste,
address scaling issues with number of open files and generally set us up better for scaling.
 It will also have advantages with very small inputs, since the buffer cache will reduce the
number of seeks needed and the data serving node can open a single file and just keep it open
rather than needing to do directory and open ops on every request.

The only issue I see is that in cases where the task output is substantiallyu larger than
its input, we may need to spill multiple times.  In this case, we can do a merge after all
spills are complete (or during the final spill).


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message