hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Lu (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention
Date Wed, 03 Mar 2010 16:42:27 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12840722#action_12840722
] 

Luke Lu commented on MAPREDUCE-1270:
------------------------------------

Fusheng, feel free to attach the design doc if there is nothing confidential in it and Shouyan
approves :). There are plenty of people on the thread who understand Chinese. It'd help me
explaining some details to Arun, now that I work next to him.

On the combiner interface, I think it'd be better to add an emitValue convenient method instead
of changing the interface, as there are quite a few legit uses.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES,
but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce
Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming,
pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares
environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report
from mapper until that finished. The mapper will read record, ivoke user defined map, to do
partition, write spill, combine and merge into file.out. We think these operations can be
done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted
files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management,
but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message