hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-331) map outputs should be written to a single output file with an index
Date Tue, 17 Oct 2006 19:03:38 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12443037 ] 
Devaraj Das commented on HADOOP-331:

A modified proposal (after Doug's comments)
   -  Keep an array (length = # of partitions) whose elements will be lists containing elements
of the form <spill#, start-key-val-offset, end-key-val-offset>*. This is obtained after
the sort and before the spill of the buffer. The spill# helps in locating the section of the
sorted partition data within a spill.
  - Merge goes over this array element by element and for each element, it looks at the contained
list elements and creates a file appending the sections one by one. Then does mergesort with
just one file as input with the output set to the final output file (opened in append mode).
 An index file is updated to contain <part#, start-offset, compressedlength>
  - At the end of the above, we have the final output file and the index file.
The negative with this approach is that it requires trips to disk for reading/writing the
sections before merge. So if we had 10 sections for a particular partition, we would have
to read 10 chunks from the sorted spills and write them to disk as a single file. The plus
is that the key format doesn't need to change at all (so minimum code change) and the number
of comparisons during merge are reduced because we merge chunks of a single partition at a

We can have the interator over keys/values as suggested by Owen and the block compression
optimization suggested by Doug. Although the iterator won't be required for the map outputs
if we adopt the approach outlined above, but will be required for the reduces, right Owen?

> map outputs should be written to a single output file with an index
> -------------------------------------------------------------------
>                 Key: HADOOP-331
>                 URL: http://issues.apache.org/jira/browse/HADOOP-331
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.3.2
>            Reporter: eric baldeschwieler
>         Assigned To: Devaraj Das
> The current strategy of writing a file per target map is consuming a lot of unused buffer
space (causing out of memory crashes) and puts a lot of burden on the FS (many opens, inodes
used, etc).  
> I propose that we write a single file containing all output and also write an index file
IDing which byte range in the file goes to each reduce.  This will remove the issue of buffer
waste, address scaling issues with number of open files and generally set us up better for
scaling.  It will also have advantages with very small inputs, since the buffer cache will
reduce the number of seeks needed and the data serving node can open a single file and just
keep it open rather than needing to do directory and open ops on every request.
> The only issue I see is that in cases where the task output is substantiallyu larger
than its input, we may need to spill multiple times.  In this case, we can do a merge after
all spills are complete (or during the final spill).

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message