hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricky Ho <...@adobe.com>
Subject RE: How many people is using Hadoop Streaming ?
Date Fri, 03 Apr 2009 17:35:52 GMT
Owen, thanks for your elaboration, the data point is very useful.

On your point ...
====================================================
In java you get
          key1, (value1, value2, ...)
          key2, (value3, ...)
in streaming you get
          key1 value1
          key1 value2
          key2 value3
and your application needs to detect the key changes.
=====================================================

I assume that the key is still sorted, right ?  That mean I will get all the "key1, valueX"
entries before getting any of the "key2 valueY" entries and key2 is always bigger than key1.

Is this correct ?

Rgds,
Ricky


-----Original Message-----
From: Owen O'Malley [mailto:omalley@apache.org] 
Sent: Friday, April 03, 2009 8:59 AM
To: core-user@hadoop.apache.org
Subject: Re: How many people is using Hadoop Streaming ?


On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:

> Has anyone benchmark the performance difference of using Hadoop ?
>  1) Java vs C++
>  2) Java vs Streaming

Yes, a while ago. When I tested it using sort, Java and C++ were  
roughly equal and streaming was 10-20% slower. Most of the cost with  
streaming came from the stringification.

>  1) I can pick the language that offers a different programming  
> paradigm (e.g. I may choose functional language, or logic  
> programming if they suit the problem better).  In fact, I can even  
> chosen Erlang at the map() and Prolog at the reduce().  Mix and  
> match can optimize me more.
>  2) I can pick the language that I am familiar with, or one that I  
> like.
>  3) Easy to switch to another language in a fine-grain incremental  
> way if I choose to do so in future.

Additionally, the interface to streaming is very stable. *smile* It  
also supports legacy applications well.

The downsides are that:
   1. The interface is very thin and has minimal functionality.
   2. Streaming combiners don't work very well. Many streaming  
applications buffer in the map
       and run the combiner internally.
   3. Streaming doesn't group the values in the reducer. In Java or C+ 
+, you get:
          key1, (value1, value2, ...)
          key2, (value3, ...)
       in streaming you get
          key1 value1
          key1 value2
          key2 value3
       and your application needs to detect the key changes.
   4. Binary data support has only recently been added to streaming.

> Am I missing something here ?  or is the majority of Hadoop  
> applications written in Hadoop Streaming ?

On Yahoo's research clusters, typically 1/3 of the applications are  
streaming, 1/3 pig, and 1/3 java.

-- Owen

Mime
View raw message