hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yair Gottdenker <y...@cotendo.com>
Subject RE: Hadoop - is it good for me and performance question
Date Tue, 01 Jul 2008 08:14:55 GMT
Thanks for your reply Haijun,

Do you know what makes Hadoop run so slow? I have been trying to figure
it out my self but I can't imagine anything so complicate that justifies
hadoop performance and latency.

-----Original Message-----
From: Haijun Cao [mailto:haijun@kindsight.net] 
Sent: Monday, June 30, 2008 9:33 PM
To: core-user@hadoop.apache.org
Subject: RE: Hadoop - is it good for me and performance question

Not sure if this will answer your question, but a similar thread
regarding hadoop performance:


Hadoop is good for log processing if you have a lot of logs to process
and you don't need the result in real time (e.g. you can accumulate one
day's log and process them in one batch, latency == 1 day). In another
word, it shines with large data set batch (long latency) processing.  It
is good at scalability (scale out), not at increasing single
core/machine performance. If your data fits in one process, then using a
distributed framework will probably slow it down.


-----Original Message-----
From: yair gotdanker [mailto:yairgot@gmail.com] 
Sent: Sunday, June 29, 2008 4:46 AM
To: core-user@hadoop.apache.org
Subject: Hadoop - is it good for me and performance question

Hello all,

I am newbie to hadoop, The technology seems very interesting but I am
sure it suit my needs.  I really appreciate your feedbacks.

The problem:

I have multiple logservers each receiving 10-100 mg/minute. The received
data is processed to produce aggregated data.
The data process time should take few minutes at top (10 min).

In addtion, I did some performance benchmark on the workcount example
provided by quickstart tutorial on my pc (pseudo-distributed, using
quickstart configurations file) and it took about 40 seconds!
I must be missing something here, I must be doing something wrong here
40 seconds is way too long!
Map/reduce function should be very fast since there is almost no
done. So I guess most of the time spend on the hadoop framework.

I will appreciate any help  for understanding this and how can I
the performance.
Does anyone know good behind the schene tutorial, that explains more on
the jobtracker/tasktracker communicate and so.

View raw message