MapReduce on Hadoop is for processing very large amounts of data, or else the overhead of framework (job scheduling, failover etc) do not justify it. If you are processing 10-100M / min = 14-140G a day. This probably justifies it's use I would say You can't get a performance estimate on a pseudo cluster on 1 machine with small amounts of data- it is just not what hadoop is designed for. I have recently gone through what you are doing, and then I went to EC2 to do my first real test at the weekend. Have you considered a test run on EC2 with 140G file? It takes about 1 day from starting to getting running unless you are into EC2 already as there is a fair amount to read and get set up, and will cost you around $5US total. I blogged my experience here which will help you avoid a couple of pitfalls: http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html I have subsequently found I ran only 1 reducer, and it was the reducer that took 50% the time - I should have run more like 10 reducers for the job I was doing... Cheers Tim On Tue, Jul 1, 2008 at 10:14 AM, Yair Gottdenker wrote: > Thanks for your reply Haijun, > > Do you know what makes Hadoop run so slow? I have been trying to figure > it out my self but I can't imagine anything so complicate that justifies > hadoop performance and latency. > > > > -----Original Message----- > From: Haijun Cao [mailto:haijun@kindsight.net] > Sent: Monday, June 30, 2008 9:33 PM > To: core-user@hadoop.apache.org > Subject: RE: Hadoop - is it good for me and performance question > > > Not sure if this will answer your question, but a similar thread > regarding hadoop performance: > > http://www.mail-archive.com/core-user@hadoop.apache.org/msg02878.html > > Hadoop is good for log processing if you have a lot of logs to process > and you don't need the result in real time (e.g. you can accumulate one > day's log and process them in one batch, latency == 1 day). In another > word, it shines with large data set batch (long latency) processing. It > is good at scalability (scale out), not at increasing single > core/machine performance. If your data fits in one process, then using a > distributed framework will probably slow it down. > > Haijun > > -----Original Message----- > From: yair gotdanker [mailto:yairgot@gmail.com] > Sent: Sunday, June 29, 2008 4:46 AM > To: core-user@hadoop.apache.org > Subject: Hadoop - is it good for me and performance question > > Hello all, > > > > I am newbie to hadoop, The technology seems very interesting but I am > not > sure it suit my needs. I really appreciate your feedbacks. > > > > The problem: > > I have multiple logservers each receiving 10-100 mg/minute. The received > data is processed to produce aggregated data. > The data process time should take few minutes at top (10 min). > > In addtion, I did some performance benchmark on the workcount example > provided by quickstart tutorial on my pc (pseudo-distributed, using > quickstart configurations file) and it took about 40 seconds! > I must be missing something here, I must be doing something wrong here > since > 40 seconds is way too long! > Map/reduce function should be very fast since there is almost no > processing > done. So I guess most of the time spend on the hadoop framework. > > I will appreciate any help for understanding this and how can I > increase > the performance. > btw: > Does anyone know good behind the schene tutorial, that explains more on > how > the jobtracker/tasktracker communicate and so. >