hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim robertson" <timrobertson...@gmail.com>
Subject Re: Hadoop - is it good for me and performance question
Date Tue, 01 Jul 2008 08:30:45 GMT
MapReduce on Hadoop is for processing very large amounts of data, or else
the overhead of framework (job scheduling, failover etc) do not justify it.
If you are processing 10-100M / min = 14-140G a day.  This probably
justifies it's use I would say

You can't get a performance estimate on a pseudo cluster on 1 machine with
small amounts of data- it is just not what hadoop is designed for.

I have recently gone through what you are doing, and then I went to EC2 to
do my first real test at the weekend.
Have you considered a test run on EC2 with 140G file?  It takes about 1 day
from starting to getting running unless you are into EC2 already as there is
a fair amount to read and get set up, and will cost you around $5US total.

I blogged my experience here which will help you avoid a couple of pitfalls:
http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html

I have subsequently found I ran only 1 reducer, and it was the reducer that
took 50% the time - I should have run more like 10 reducers for the job I
was doing...

Cheers

Tim


On Tue, Jul 1, 2008 at 10:14 AM, Yair Gottdenker <yair@cotendo.com> wrote:

> Thanks for your reply Haijun,
>
> Do you know what makes Hadoop run so slow? I have been trying to figure
> it out my self but I can't imagine anything so complicate that justifies
> hadoop performance and latency.
>
>
>
> -----Original Message-----
> From: Haijun Cao [mailto:haijun@kindsight.net]
> Sent: Monday, June 30, 2008 9:33 PM
> To: core-user@hadoop.apache.org
> Subject: RE: Hadoop - is it good for me and performance question
>
>
> Not sure if this will answer your question, but a similar thread
> regarding hadoop performance:
>
> http://www.mail-archive.com/core-user@hadoop.apache.org/msg02878.html
>
> Hadoop is good for log processing if you have a lot of logs to process
> and you don't need the result in real time (e.g. you can accumulate one
> day's log and process them in one batch, latency == 1 day). In another
> word, it shines with large data set batch (long latency) processing.  It
> is good at scalability (scale out), not at increasing single
> core/machine performance. If your data fits in one process, then using a
> distributed framework will probably slow it down.
>
> Haijun
>
> -----Original Message-----
> From: yair gotdanker [mailto:yairgot@gmail.com]
> Sent: Sunday, June 29, 2008 4:46 AM
> To: core-user@hadoop.apache.org
> Subject: Hadoop - is it good for me and performance question
>
> Hello all,
>
>
>
> I am newbie to hadoop, The technology seems very interesting but I am
> not
> sure it suit my needs.  I really appreciate your feedbacks.
>
>
>
> The problem:
>
> I have multiple logservers each receiving 10-100 mg/minute. The received
> data is processed to produce aggregated data.
> The data process time should take few minutes at top (10 min).
>
> In addtion, I did some performance benchmark on the workcount example
> provided by quickstart tutorial on my pc (pseudo-distributed, using
> quickstart configurations file) and it took about 40 seconds!
> I must be missing something here, I must be doing something wrong here
> since
> 40 seconds is way too long!
> Map/reduce function should be very fast since there is almost no
> processing
> done. So I guess most of the time spend on the hadoop framework.
>
> I will appreciate any help  for understanding this and how can I
> increase
> the performance.
> btw:
> Does anyone know good behind the schene tutorial, that explains more on
> how
> the jobtracker/tasktracker communicate and so.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message