hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-331) map outputs should be written to a single output file with an index
Date Tue, 17 Oct 2006 06:08:37 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12442793 ] 
            
Devaraj Das commented on HADOOP-331:
------------------------------------

I would ideally like to use as much of the existing code/tools as possible. So, here is another
proposal:
1) Generate a master index as the buffers are filled with key/value pairs. The master index
is an array indexed by partition number and each entry in the array contain values of the
form <filename, keyoffset, valoffset>*.  The filename refers to the filename where the
containing buffer was spilled. Think of this as each element in the array is an array by itself.
2) Filled buffers are spilled and NO sorting is done when a spill happens.
3) At the end, we have a master index and we go through that index by index ( partition by
partition). Since we know the filename and the key/value offsets, we can access the key/value
data with ease. We create a temporary fixed-sized in-memory buffer for each array element
and if we can accomodate all the key/value pairs for the particular array element in that
buffer, well and good; we sort that in-memory buffer and append the sorted data to a final
output file. If we cannot fit in everything into the buffer, we spill and sort the spills
after we spill all the key/value pairs.
4) While we do (3), we can trivially generate the final index file.
5) At the end of the step (3), we have a big file with sorted data (by partition) and a single
index file. 
The problems of implicit partition numbers are not there with this approach. Also, we can
do parallel sorts (like two at a time or something) of the array elements and generate multiple
files and just concatenate the sorted data at the end of the sorts to that single output file.
The negative with this approach is that we need another temp buffer (but it may be required
in the other approaches also). Think the master index file won't be very big (it will be roughly
of the order of number of records processed by a single map).
Thoughts ?

> map outputs should be written to a single output file with an index
> -------------------------------------------------------------------
>
>                 Key: HADOOP-331
>                 URL: http://issues.apache.org/jira/browse/HADOOP-331
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.3.2
>            Reporter: eric baldeschwieler
>         Assigned To: Devaraj Das
>
> The current strategy of writing a file per target map is consuming a lot of unused buffer
space (causing out of memory crashes) and puts a lot of burden on the FS (many opens, inodes
used, etc).  
> I propose that we write a single file containing all output and also write an index file
IDing which byte range in the file goes to each reduce.  This will remove the issue of buffer
waste, address scaling issues with number of open files and generally set us up better for
scaling.  It will also have advantages with very small inputs, since the buffer cache will
reduce the number of seeks needed and the data serving node can open a single file and just
keep it open rather than needing to do directory and open ops on every request.
> The only issue I see is that in cases where the task output is substantiallyu larger
than its input, we may need to spill multiple times.  In this case, we can do a merge after
all spills are complete (or during the final spill).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message