hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang <teddyyyy...@gmail.com>
Subject controlling where a map task runs?
Date Tue, 17 Jan 2012 21:01:49 GMT
I understand that normally map tasks are run close to the input files.

but in my application, the input file is a txt file with many lines of
query param, and the mapper reads out each line,
use the params in the line to query a local db file (for example sqlite3 ),
so the query itself takes a lot of time,
and the input query param is very small. so in this case the time to fetch
the input file is negligible . the db file is
always sitting on all the boxes in the cluster, so there is no time to copy
the db.

the problem is , when I have an empty cluster (100 nodes), and have a task
with only 4 mappers, hadoop schedules out the 4 mappers all on the same
node, likely close to where the data is. but since the run time here is
mostly determined by CPU and disk seeking,
I would like to spread them out as much as possible.

given  that the data is already present only on 1 node, how is it possible
to spread out my mappers?


View raw message