hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Kricke <matthias.mk.kri...@gmail.com>
Subject how to enhance job start up speed?
Date Mon, 13 Aug 2012 11:51:55 GMT
Hello all,

I'm using CDH3u3.
If I want to process one File, set to non splitable hadoop starts one
Mapper and no Reducer (thats ok for this test scenario). The Mapper
goes through a configuration step where some variables for the worker
inside the mapper are initialized.
Now the Mapper gives me K,V-pairs, which are lines of an input file. I
process the V with the worker.

When I compare the run time of hadoop to the run time of the same process
in sequentiell manner, I get:

worker time --> same in both cases

case: mapper --> overhead of ~32% to the worker process (same for bigger
chunk size)
case: sequentiell --> overhead of ~15% to the worker process

It shouldn't be that much slower, because of non splitable, the mapper will
be executed where the data is saved by HDFS, won't it?
Where did those 17% go? How to reduce this? Did hadoop needs the whole time
for reading or streaming the data out of HDFS?

I would appreciate your help,


View raw message