hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binglin Chang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-1270) Hadoop C++ Extention
Date Sun, 14 Aug 2011 03:15:30 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084777#comment-13084777

Binglin Chang commented on MAPREDUCE-1270:

Hi, Arun
HCE2.0 is mainly focused on stability(bugfix) and usability
Bugfix: HCE is not very stable right now, although we fix a lot bugs, current codebase is
a mess:( a lot work need to be done, but currently no time(other projects).
Usability: (bi)streaming over HCE is now released, and PyHCE, as (bi)streaming & python
is much popular than java api in Baidu; C++ version of partitioners such as KeyFieldBasedPartitioner;
Input/OuputFormats such as SequenceFile, CombineInput.., multiple output; and compression
codecs such as lzma, lzo, quicklz;
As for performance, SSE optimization(memcmp, memchr) are used(crc32c not added yet), we gain
another 10-20%, both in Hadoop & upper level application.

About MR-v2
We are keep watching your progress and have read your design doc & some code already,
looking forward further discussion on this very interesting topic.

> Hadoop C++ Extention
> --------------------
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>         Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE Performance Report.pdf,
HCE Tutorial.pdf, Overall Design of Hadoop C++ Extension.doc
>   Hadoop C++ extension is an internal project in baidu, We start it for these reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to use PIPES,
but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop map/reduce
Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using PIPES or Streaming,
pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just prepares
environment, starts C++ mapper, tells mapper which split it should  deal with, and reads report
from mapper until that finished. The mapper will read record, ivoke user defined map, to do
partition, write spill, combine and merge into file.out. We think these operations can be
done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it read from sorted
files, ivoke user difined reduce, and write to user defined record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and management,
but everything in execution.
> Now you can get a test version of HCE from this link http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and deploy it
in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program and give other
specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE compared to
Java MapRed and Pipes.
> Any comments are welcomed.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message