hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: Integration with SGE
Date Wed, 18 Feb 2009 18:57:19 GMT

On Feb 18, 2009, at 10:37 AM, Amin Astaneh wrote:

> Lukáš-
> Well, we have a graduate student that is using our facilities for a  
> Masters' thesis in Map/Reduce. You guys are generating topics in  
> computer science research.
> What do we need to do in order to get our documentation on the  
> Hadoop pages?

You have a couple of options:
a) Put it on the Hadoop wiki (http://wiki.apache.org/hadoop/), for  
e.g. look at the ones which have docs on using Hadoop on EC2/S3.
b) Open a jira (Create New Issue at https://issues.apache.org/jira/browse/HADOOP) 
  and attach forrest-based documentation.


> -Amin
>> Thanks guys,it is good to head that Hadoop is spreading... :-)
>> Regards,
>> Lukas
>> On Wed, Feb 18, 2009 at 5:24 PM, Steve Loughran <stevel@apache.org>  
>> wrote:
>>> Amin Astaneh wrote:
>>>> Lukáš-
>>>>> Hi Amin,
>>>>> I am not familiar with SGE, do you think you could tell me what  
>>>>> did you
>>>>> get
>>>>> from this combination? What is the benefit of running Hadoop on  
>>>>> SGE?
>>>> Sun Grid Engine is a distributed resource management platform for
>>>> supercomputing centers. We use it to allocate resources to a  
>>>> supercomputing
>>>> task, such as requesting 32 processors to run a particular  
>>>> simulation. This
>>>> mechanism is analogous to the scheduler on a multi-user OS. What  
>>>> I was able
>>>> to accomplish was to turn Hadoop into an as-needed service. When  
>>>> you submit
>>>> a job request to run Hadoop as the documentation describes, a  
>>>> Hadoop cluster
>>>> of arbitrary size is instantiated depending on how many nodes  
>>>> were requested
>>>> by generating a cluster configuration specific to that job  
>>>> request. This
>>>> allows the Hadoop cluster to be deployed within the context of  
>>>> Gridengine,
>>>> as well as being able to coexist with other running simulations  
>>>> on the
>>>> cluster.
>>>> To the researcher or user needing to run a mapreduce code, all  
>>>> they need
>>>> to worry about is telling Hadoop to execute it as well as  
>>>> determining how
>>>> many machines should be dedicated to the task. This benefit makes  
>>>> Hadoop
>>>> very accessible to people since they don't need to worry about  
>>>> configuring a
>>>> cluster, SGE and it's helper scripts do it for them.
>>>> As Steve Loughran accurately commented, as of now we can only run  
>>>> one set
>>>> of Hadoop slave processes per machine, due to the network binding  
>>>> issue.
>>>> That problem is mitigated by configuring SGE to spread the slaves  
>>>> one per
>>>> machine automatically to avoid failures.
>>> Only the Namenode and JobTracker need hard-coded/well-known port  
>>> numbers,
>>> the rest could all be done dynamically.
>>> One thing SGE does offer over Xen-hosted images is better  
>>> performance than
>>> virtual machines, for both CPU  and storage, as  virtualised disk
>>> performance can be awful, and even on the latest x86 parts, there  
>>> is a
>>> measurable hit from VM overheads.

View raw message