hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "eric baldeschwieler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-331) map outputs should be written to a single output file with an index
Date Mon, 16 Oct 2006 03:42:36 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12442453 ] 
            
eric baldeschwieler commented on HADOOP-331:
--------------------------------------------

re: devaraj

I like the approach.  One refinement suggested below:

I don't think you want to store the partkeys inline.  That requires more code change and an
on disk format changes and wasted bytes to disk and over the wire.  I think you spill serialized
key/values with a side file that maps each partition to a start offset.

In RAM you spill serialized key/value pairs to your buffer and also keep an array/vector (apply
appropriate java class here) of (partition,offset to key).  You can then quicksort the array
and spill.  You want to be sure to be able to apply a block compressor to each partition as
spilled.  This will be very efficient and simple. So record the compressed lengths (kimoon
suggested this on another thread).

Merging would go as you outline.  You could read one line of each sidefile and then merge
the next partition from each, so the merge would only consider the keys.  Since it would be
per partition.

You need the sidefile to support efficient access for the reduce readers anyway.

---
re: brian's comments

I think we should keep maps simple and focus this effort on reduces, which deal with much
larger size.

That said, a corner case with HUGE maps should have a reasonable outcome. I think we need
a stripped file abstraction to deal with these cases, where outputs are placed in medium HDFS
sized blocks on whichever disk makes the most sense.  This same approach would probably be
more used on the reduce side.

But I think this should come as a second project, rather than burdening this work with it.
Anyone want to file a bug on it?

> map outputs should be written to a single output file with an index
> -------------------------------------------------------------------
>
>                 Key: HADOOP-331
>                 URL: http://issues.apache.org/jira/browse/HADOOP-331
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.3.2
>            Reporter: eric baldeschwieler
>         Assigned To: Devaraj Das
>
> The current strategy of writing a file per target map is consuming a lot of unused buffer
space (causing out of memory crashes) and puts a lot of burden on the FS (many opens, inodes
used, etc).  
> I propose that we write a single file containing all output and also write an index file
IDing which byte range in the file goes to each reduce.  This will remove the issue of buffer
waste, address scaling issues with number of open files and generally set us up better for
scaling.  It will also have advantages with very small inputs, since the buffer cache will
reduce the number of seeks needed and the data serving node can open a single file and just
keep it open rather than needing to do directory and open ops on every request.
> The only issue I see is that in cases where the task output is substantiallyu larger
than its input, we may need to spill multiple times.  In this case, we can do a merge after
all spills are complete (or during the final spill).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message