hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhu hui <chinazhuhu...@gmail.com>
Subject Re: hadoop performance with very small cluster
Date Fri, 22 May 2009 05:54:25 GMT
hi, Miles, Brian.

Thanks very much for both of your kindly reply and analysis. And it makes me
catch some important points of the problem.

Best Wishes.

Eric. Syu

On Thu, May 21, 2009 at 9:12 PM, Brian Bockelman <bbockelm@cse.unl.edu>wrote:

>
> On May 21, 2009, at 2:30 AM, Miles Osborne wrote:
>
>  if you mean "hadoop does not give a speed-up compared with a
>> sequential version" then this is because of overhead associated with
>> running the framework:  your job will need to be scheduled, JVMs
>> instantiated, data copied, data sorted etc etc.
>>
>
> Eric,
>
> It depends on your problem.  If you have a java program that's got lots of
> CPU per key and already is map-reduce-like, you'll probably see pretty good
> efficiency.  If you have a highly optimized assembler program that runs in
> seconds, you'll probably see poor "efficiency" (however you might be
> defining that).
>
> Let's say you have N machines and the program takes L seconds on 1 machine.
>  Assume that the overhead is 10 seconds for framework initialization
> (perhaps conservative?).  Then, the total runtime is L/N + 10; the speedup
> is L/(L/N+10).
>
> Now, plug in estimates for your cluster.  If L->0 or N->infinity, then the
> dominate term in the expression is the 10 seconds for initialization.  So,
> if N=5 and your original problem took 1 minute, your maximum speedup is
> about 3.  If your initial problem took 1 hour, the maximum speedup is about
> 5.   (Look up Amdahl's law, that's all I'm applying...)
>
> So, like the answer to most general questions, the answer is "it depends".
>  For the most part, it depends wholly on whether your problem can be
> parallelized and your problem runtime versus the Hadoop overhead.  Even if
> Hadoop might not provide a huge speedup currently, I'd add to Miles'
> comment: not only would the solution be easier to maintain, but it would
> also be easier to grow when you decide you need, say, 100 machines to
> process your problem.
>
> Brian
>
>
>
>>
>> if your jobs can be parallelised and you have enough machines (your
>> cluster is large enough) then the ability to use more machines should
>> compensate for the framework overhead.
>>
>> even if your sequential / hacked version running on a small cluster
>> beats the hadoop version, in my mind a major advantage of Hadoop (and
>> this is something that people tend to forget) is that your Hadoop
>> version almost certainly will be simpler and easier to maintain.
>>
>> Miles
>>
>> 2009/5/21 zhu hui <chinazhuhui04@gmail.com>:
>>
>>> hello, everybody.
>>>
>>> i am fresh to hadoop, and i heard from others that hadoop performs not
>>> efficient when the cluster is very small,for example 6 machines.
>>>
>>> but i cannot find out the reasons and materials that i can make them as
>>> the
>>> proofs.
>>>
>>> thanks very much if anybody who can share me with some materials or
>>> ideas.
>>>
>>> Best Wishes.
>>>
>>> Eric.Syu
>>>
>>> --
>>> Nothing Impossible
>>>
>>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>
>


-- 
Nothing Impossible

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message