hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Kelkar <rohitkel...@gmail.com>
Subject Re: controlling where a map task runs?
Date Wed, 18 Jan 2012 07:34:44 GMT
You could try NLineInputFormat. If N = 1 then the number of mappers
would be equal to the number of lines in your file. Now if the number
of mappers required is greater than max number of mappers that can be
run on a node then I think that the remaining mappers would get
scheduled on the other nodes in the clusters without obeying the data
locality. Of course N=1 is an extreme case, you could try changing the
value of N based on the number of lines in your input file.

- Rohit Kelkar

On Wed, Jan 18, 2012 at 2:31 AM, Yang <teddyyyy123@gmail.com> wrote:
> I understand that normally map tasks are run close to the input files.
> but in my application, the input file is a txt file with many lines of query
> param, and the mapper reads out each line,
> use the params in the line to query a local db file (for example sqlite3 ),
> so the query itself takes a lot of time,
> and the input query param is very small. so in this case the time to fetch
> the input file is negligible . the db file is
> always sitting on all the boxes in the cluster, so there is no time to copy
> the db.
> the problem is , when I have an empty cluster (100 nodes), and have a task
> with only 4 mappers, hadoop schedules out the 4 mappers all on the same
> node, likely close to where the data is. but since the run time here is
> mostly determined by CPU and disk seeking,
> I would like to spread them out as much as possible.
> given  that the data is already present only on 1 node, how is it possible
> to spread out my mappers?
> Thanks
> Yang

View raw message