Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <499C5582.7070409@rc.usf.edu>
Date: Wed, 18 Feb 2009 13:37:54 -0500
From: Amin Astaneh <aastaneh@rc.usf.edu>
User-Agent: Thunderbird 2.0.0.19 (X11/20090105)
MIME-Version: 1.0
To: core-dev@hadoop.apache.org
Subject: Re: Integration with SGE
References: <499AD210.5010900@rc.usf.edu>
	 <52c3ddca0902180149v5f6ee302lb6d0be0effc6d79b@mail.gmail.com>
	 <499C1DEF.6060808@rc.usf.edu> <499C3640.1080703@apache.org>
 <52c3ddca0902181001l5ba51e2cl14fab75560bb4694@mail.gmail.com>
In-Reply-To: <52c3ddca0902181001l5ba51e2cl14fab75560bb4694@mail.gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit

Luk�-

Well, we have a graduate student that is using our facilities for a 
Masters' thesis in Map/Reduce. You guys are generating topics in 
computer science research.

What do we need to do in order to get our documentation on the Hadoop pages?

-Amin
> Thanks guys,it is good to head that Hadoop is spreading... :-)
> Regards,
> Lukas
>
> On Wed, Feb 18, 2009 at 5:24 PM, Steve Loughran <stevel@apache.org> wrote:
>
>   
>> Amin Astaneh wrote:
>>
>>     
>>> Luk�-
>>>
>>>       
>>>> Hi Amin,
>>>> I am not familiar with SGE, do you think you could tell me what did you
>>>> get
>>>> from this combination? What is the benefit of running Hadoop on SGE?
>>>>
>>>>
>>>>         
>>> Sun Grid Engine is a distributed resource management platform for
>>> supercomputing centers. We use it to allocate resources to a supercomputing
>>> task, such as requesting 32 processors to run a particular simulation. This
>>> mechanism is analogous to the scheduler on a multi-user OS. What I was able
>>> to accomplish was to turn Hadoop into an as-needed service. When you submit
>>> a job request to run Hadoop as the documentation describes, a Hadoop cluster
>>> of arbitrary size is instantiated depending on how many nodes were requested
>>> by generating a cluster configuration specific to that job request. This
>>> allows the Hadoop cluster to be deployed within the context of Gridengine,
>>> as well as being able to coexist with other running simulations on the
>>> cluster.
>>>
>>> To the researcher or user needing to run a mapreduce code, all they need
>>> to worry about is telling Hadoop to execute it as well as determining how
>>> many machines should be dedicated to the task. This benefit makes Hadoop
>>> very accessible to people since they don't need to worry about configuring a
>>> cluster, SGE and it's helper scripts do it for them.
>>>
>>> As Steve Loughran accurately commented, as of now we can only run one set
>>> of Hadoop slave processes per machine, due to the network binding issue.
>>> That problem is mitigated by configuring SGE to spread the slaves one per
>>> machine automatically to avoid failures.
>>>
>>>       
>> Only the Namenode and JobTracker need hard-coded/well-known port numbers,
>> the rest could all be done dynamically.
>>
>> One thing SGE does offer over Xen-hosted images is better performance than
>> virtual machines, for both CPU  and storage, as  virtualised disk
>> performance can be awful, and even on the latest x86 parts, there is a
>> measurable hit from VM overheads.
>>
>>     
>
>
>
>