hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Milstein <dmilst...@hubspot.com>
Subject Re: Hadoop & Python
Date Thu, 21 May 2009 12:19:25 GMT
One thing about the | sort | sh combiner.sh approach: you do have to  
be careful about memory if you're doing that -- if a mapper instance  
sees a large number of rows, you'll be asking sort to sort *all* of  
those before passing them to the combiner.  Hadoop itself only hands  
off some bounded number of output keys at a time to the combiner,  
which is much safer for large data sets.

In dumbo itself, Klaas added "combine a chunk at a time", to address  
this problem.

(and, yes, overall, getting combines fully supported in streaming is  
awesome)

-D

On May 19, 2009, at 5:04 PM, Peter Skomoroch wrote:

> Whoops, should have googled it first.  Looks like this is now fixed in
> trunk, HADOOP-4842.  For people stuck using 18.3, a workaround  
> appears to be
> adding something like "| sort | sh combiner.sh" to the call of the  
> mapper
> script (via Klaas Bosteels)
>
> Would be great to get this patched into distributions like EMR and  
> Cloudera
>
> On Tue, May 19, 2009 at 4:59 PM, Peter Skomoroch
> <peter.skomoroch@gmail.com>wrote:
>
>> One area I'm curious about is the requirement that any combiners in
>> Streaming jobs be java classes.  Are there any plans to change this  
>> in the
>> future?  Prototyping streaming jobs in Python is great, and the  
>> ability to
>> use a Python combiner would help performance a lot without needing  
>> to move
>> to Java.
>>
>>
>>
>>
>> On Tue, May 19, 2009 at 4:30 PM, Amr Awadallah <aaa@cloudera.com>  
>> wrote:
>>
>>> S d,
>>>
>>> It is totally fine to use Python streaming if it does the job you  
>>> are
>>> after, there will be a slight performance hit, but that is noise  
>>> assuming
>>> your cluster is a small one. If you are operating a large cluster
>>> continuously, then once your logic is stabilized using Python it  
>>> might make
>>> sense to convert/operationalize some jobs to Java (or C pipes) to  
>>> improve
>>> performance for purpose of finishing quicker or reducing number of  
>>> servers
>>> needed.
>>>
>>> You should also take a look at PIG and Hive, they are both higher  
>>> level
>>> languages and very easy to learn:
>>>
>>> http://www.cloudera.com/hadoop-training-pig-introduction
>>>
>>> http://www.cloudera.com/hadoop-training-hive-introduction
>>>
>>> -- amr
>>>
>>>
>>> s d wrote:
>>>
>>>> Thanks.
>>>> So in the overall scheme of things, what is the general feeling  
>>>> about
>>>> using
>>>> python for this? I like the ease of deploying and reading python  
>>>> compared
>>>> with Java but want to make sure using python over hadoop is  
>>>> scalable & is
>>>> standard practice and not something done only for prototyping and  
>>>> small
>>>> scale tests.
>>>>
>>>>
>>>> On Tue, May 19, 2009 at 9:48 AM, Alex Loddengaard <alex@cloudera.com 
>>>> >
>>>> wrote:
>>>>
>>>>
>>>>
>>>>> Streaming is slightly slower than native Java jobs.  Otherwise  
>>>>> Python
>>>>> works
>>>>> great in streaming.
>>>>>
>>>>> Alex
>>>>>
>>>>> On Tue, May 19, 2009 at 8:36 AM, s d <s.d.sauron@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi,
>>>>>> How robust is using hadoop with python over the streaming  
>>>>>> protocol? Any
>>>>>> disadvantages (performance? flexibility?) ?  It just strikes me 

>>>>>> that
>>>>>>
>>>>>>
>>>>> python
>>>>>
>>>>>
>>>>>> is so much more convenient when it comes to deploying and  
>>>>>> crunching
>>>>>> text
>>>>>> files.
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Peter N. Skomoroch
>> 617.285.8348
>> http://www.datawrangling.com
>> http://delicious.com/pskomoroch
>> http://twitter.com/peteskomoroch
>>
>
>
>
> -- 
> Peter N. Skomoroch
> 617.285.8348
> http://www.datawrangling.com
> http://delicious.com/pskomoroch
> http://twitter.com/peteskomoroch


Mime
View raw message