hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: DBInputFormat number of mappers
Date Wed, 14 Apr 2010 16:28:35 GMT
If you're performing a simple import of an entire table, sqoop may
make your life easier. It gives you a reasonable command line client
for importing single tables or an entire database (provided there is a
JDBC driver available for it). Sqoop comes with Cloudera's
distribution for Hadoop or you can snag the source from
http://github.com/cloudera/sqoop

If sqoop isn't appealing for some reason, you can at least take a look
at the code and see what Aaron did under the hood.

On Tue, Apr 13, 2010 at 8:09 AM, Dan Harvey <dan.harvey@mendeley.com> wrote:
> Right, after sending this e-mail out that started working straight away with
> no changes... So setting the number of mappers in the code using :-
>
> job.getConfiguration().setInt("mapred.map.tasks", 4);
>
> allowed me to specify the number of splits/map tasks.
>
> Which lead me to the second problem I've been getting for awhile. When I
> start a hadoop job using DBInputFormat as the input if I use 5 splits say
> one will start straight away and the others will stay in the initializing
> until it is done then carry on one at a time. This doesn't happen all the
> time though and using the same code and database some will sometimes start
> in parallel!
>
> I've read this has happened to others before but no clear solution was found
> then.
>
> Has anyone else had this before or found a way to solve it?
>
> Thanks,
>
> On 13 April 2010 15:46, Dan Harvey <dan.harvey@mendeley.com> wrote:
>
>> Hi,
>>
>> I'm importing data from a mysql database using the DBInputFormat to go over
>> the rows in a table and put them into HBase with the mapper but I can't find
>> a way to increase the number of maps it splits the input into. I am running
>> this on a cluster where we have 5 nodes and each node has a maximum of 2 map
>> tasks. So for example if I set the number of rows to import to be 10,000,000
>> then there will only 2 maps tasks and use only two of the nodes..
>>
>> I've tried increasing the limit manually in the code with :
>>
>> job.getConfiguration().setInt("mapred.map.tasks", 4);
>>
>> increasing the number on the command line to set the same property, and
>> also increasing the number of map tasks per node.
>> But in all cases mapred.map.tasks is set to 2 in the job xml config file.
>>
>> I've had a look at the code and DBInputFormat splits the total number of
>> rows over mapred.map.tasks, so I'm guessing it's just getting that to
>> change.
>>
>> It would be great if anyone has any ideas what's going on?
>>
>> Thanks,
>>
>> --
>> Dan Harvey | Datamining Engineer
>> www.mendeley.com/profiles/dan-harvey
>>
>> Mendeley Limited | London, UK | www.mendeley.com
>> Registered in England and Wales | Company Number 6419015
>>
>
>
>
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>



-- 
Eric Sammer
phone: +1-917-287-2675
twitter: esammer
data: www.cloudera.com

Mime
View raw message