Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 17800 invoked from network); 15 Apr 2010 13:09:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 15 Apr 2010 13:09:45 -0000 Received: (qmail 32544 invoked by uid 500); 15 Apr 2010 13:09:44 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 32480 invoked by uid 500); 15 Apr 2010 13:09:44 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 32472 invoked by uid 99); 15 Apr 2010 13:09:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Apr 2010 13:09:43 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dan.harvey@mendeley.com designates 209.85.218.211 as permitted sender) Received: from [209.85.218.211] (HELO mail-bw0-f211.google.com) (209.85.218.211) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Apr 2010 13:09:39 +0000 Received: by bwz3 with SMTP id 3so1263242bwz.11 for ; Thu, 15 Apr 2010 06:09:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.239.157.131 with HTTP; Thu, 15 Apr 2010 06:09:16 -0700 (PDT) In-Reply-To: References: Date: Thu, 15 Apr 2010 14:09:16 +0100 Received: by 10.239.177.212 with SMTP id w20mr13283hbf.126.1271336956752; Thu, 15 Apr 2010 06:09:16 -0700 (PDT) Message-ID: Subject: Re: DBInputFormat number of mappers From: Dan Harvey To: general@hadoop.apache.org Content-Type: multipart/alternative; boundary=001636499eb3b274920484463356 --001636499eb3b274920484463356 Content-Type: text/plain; charset=ISO-8859-1 Unfortunately the tables I'm importing need application logic to extract the data full which means I can use sqoop, but I will be using that for all the simpler tables. I think I found the problem was with locking the tables in mysql which caused the mappers to run one at a time, so I've re-wrote the code so I'm not locking tables now and that's fixed it from happening completely. It's odd that it didn't always go one at a time though but I guess that might just been to do with how the mysql server was dealing with the queries at different times. Thanks, On 14 April 2010 17:28, Eric Sammer wrote: > If you're performing a simple import of an entire table, sqoop may > make your life easier. It gives you a reasonable command line client > for importing single tables or an entire database (provided there is a > JDBC driver available for it). Sqoop comes with Cloudera's > distribution for Hadoop or you can snag the source from > http://github.com/cloudera/sqoop > > If sqoop isn't appealing for some reason, you can at least take a look > at the code and see what Aaron did under the hood. > > On Tue, Apr 13, 2010 at 8:09 AM, Dan Harvey > wrote: > > Right, after sending this e-mail out that started working straight away > with > > no changes... So setting the number of mappers in the code using :- > > > > job.getConfiguration().setInt("mapred.map.tasks", 4); > > > > allowed me to specify the number of splits/map tasks. > > > > Which lead me to the second problem I've been getting for awhile. When I > > start a hadoop job using DBInputFormat as the input if I use 5 splits say > > one will start straight away and the others will stay in the initializing > > until it is done then carry on one at a time. This doesn't happen all the > > time though and using the same code and database some will sometimes > start > > in parallel! > > > > I've read this has happened to others before but no clear solution was > found > > then. > > > > Has anyone else had this before or found a way to solve it? > > > > Thanks, > > > > On 13 April 2010 15:46, Dan Harvey wrote: > > > >> Hi, > >> > >> I'm importing data from a mysql database using the DBInputFormat to go > over > >> the rows in a table and put them into HBase with the mapper but I can't > find > >> a way to increase the number of maps it splits the input into. I am > running > >> this on a cluster where we have 5 nodes and each node has a maximum of 2 > map > >> tasks. So for example if I set the number of rows to import to be > 10,000,000 > >> then there will only 2 maps tasks and use only two of the nodes.. > >> > >> I've tried increasing the limit manually in the code with : > >> > >> job.getConfiguration().setInt("mapred.map.tasks", 4); > >> > >> increasing the number on the command line to set the same property, and > >> also increasing the number of map tasks per node. > >> But in all cases mapred.map.tasks is set to 2 in the job xml config > file. > >> > >> I've had a look at the code and DBInputFormat splits the total number of > >> rows over mapred.map.tasks, so I'm guessing it's just getting that to > >> change. > >> > >> It would be great if anyone has any ideas what's going on? > >> > >> Thanks, > >> > >> -- > >> Dan Harvey | Datamining Engineer > >> www.mendeley.com/profiles/dan-harvey > >> > >> Mendeley Limited | London, UK | www.mendeley.com > >> Registered in England and Wales | Company Number 6419015 > >> > > > > > > > > -- > > Dan Harvey | Datamining Engineer > > www.mendeley.com/profiles/dan-harvey > > > > Mendeley Limited | London, UK | www.mendeley.com > > Registered in England and Wales | Company Number 6419015 > > > > > > -- > Eric Sammer > phone: +1-917-287-2675 > twitter: esammer > data: www.cloudera.com > -- Dan Harvey | Datamining Engineer www.mendeley.com/profiles/dan-harvey Mendeley Limited | London, UK | www.mendeley.com Registered in England and Wales | Company Number 6419015 --001636499eb3b274920484463356--