hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adarsh Sharma <adarsh.sha...@orkash.com>
Subject Re: Deficiency in Hadoop
Date Thu, 11 Nov 2010 12:48:35 GMT
Steve Loughran wrote:
> On 11/11/10 11:02, Adarsh Sharma wrote:
>> Dear all,
>>
>> Does anyone have an experience on working Hadoop Integration with SGE (
>> Sun Grid Engine ).
>> It is open -source too ( sge-6.2u5 ).
>> Did SGE really overcomes some of the deficiencies of Hadoop.
>> According to a article :-
>
> That'll be DanT's posting
> http://blogs.sun.com/templedf/entry/leading_the_herd
>
>>
>> Instead, to set the stage, let's talk about what Hadoop doesn't do so
>> well. I currently see two important deficiencies in Hadoop: it doesn't
>> play well with others, and it has no real accounting framework. Pretty
>> much every customer I've seen running Hadoop does it on a dedicated
>> cluster. Why? Because the tasktrackers assume they own the machines on
>> which they run. If there's anything on the cluster other than Hadoop,
>> it's in direct competition with Hadoop. That wouldn't be such a big deal
>> if Hadoop clusters didn't tend to be so huge. Folks are dedicating
>> hundreds, thousands, or even tens of thousands of machines to their
>> Hadoop applications. That's a lot of hardware to be walled off for a
>> single purpose. Are those machines really being used? You may not be
>> able to tell. You can monitor state in the moment, and you can grep
>> through log files to find out about past usage (Gah!), but there's no
>> historical accounting capability there.
>>
>> So I want to know that is it worthful to use SGE with Hadoop in
>> Production Cluster or not.
>> Please share your views.
>>
>
> A permanently allocated set of machines gives you permanent HDFS 
> storage at the cost of SATA HDDs. Once you go to any on-demand 
> infrastructure you need some persistent store, and it tends to lack 
> locality and have a higher cost/GB, usually because it is SAN-based.
>
> Where on-demand stuff is good for is for sharing physical machines, 
> because unless you can keep the CPU+RAM busy in your cluster, that's 
> wasted CAPEX/OPEX budgets.
>
> One thing that's been discussed is to have a physical hadoop cluster, 
> but have the TT's capacity reporting work well with other schedulers, 
> via some plugin point:
>
> https://issues.apache.org/jira/browse/MAPREDUCE-1603
>
> This would let your cluster also accept work from other job execution 
> frameworks, and when busy with that work, report less slots to the TT, 
> though still serve up data to the rest of the hadoop workers
>
> Benefits
>  -cost of storage is HDFS rates
>  -performance of a normal hadoop cluster
>  -under-utilised hadoop cluster time can be used by other work 
> schedulers, ones that don't need access to the Hadoop storage.
>
> Costs:
>  -HDFS security -can you lock it down?
>  -your other workloads had better not expect SAN or low-latency 
> interconnect like Infiniband, unless you add them to the cluster too, 
> which bumps up costs.
>
> Nobody has implemented this yet, so volunteers to take up their IDE 
> against Hadoop 0.23 would be welcome. And yes, I do mean 0.23, that's 
> the schedule that would work.
>
> -Steve
Thanks a Lot!   Steve
This is the way to explain other doubts.

Best Regards
-Adarsh


Mime
View raw message