hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Using Hadoop for near real-time processing of log data
Date Wed, 25 Feb 2009 21:18:19 GMT
>>Yeah, but what's the point of using Hadoop then? i.e. we lost all the
>>parallelism?

Some jobs do not need it. For example, I am working with the Hive sub
project. If I have a table that is less then my block size. Having a
large number of mappers or reducers is counter productive. Hadoop will
start up mappers that never get any data. Setting the job tracker to
'local' or setting map tasks and reduce tasks to 1 makes  the job
finish faster. 20 seconds vs 10 seconds.

If you have a small data set and a system with 8 cores, the MiniMR
cluster can possibly be used as an embedded hadoop. For some jobs the
most efficient parallelism might be 1.

WordCount of "1 2 3 4 5 6" on  the MiniMRCluster test case takes less
then two seconds.

It may not be the common case, but it may be feasible to use hadoop in
that manner.

Mime
View raw message