Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dan.harvey@mendeley.com
 designates 209.85.218.211 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <u2l23d54121004140928v58286578u2d66ed08196722e0@mail.gmail.com>
References: <q2ha7988b8e1004130746t8ab3e3dap5607f7f8e53053b8@mail.gmail.com>
	 <m2va7988b8e1004130809l40cde215icd5e1ab561b92f49@mail.gmail.com>
	 <u2l23d54121004140928v58286578u2d66ed08196722e0@mail.gmail.com>
Date: Thu, 15 Apr 2010 14:09:16 +0100
Message-ID: <s2ha7988b8e1004150609kb3e67dbct8fe49a2aa5b8bebd@mail.gmail.com>
Subject: Re: DBInputFormat number of mappers
From: Dan Harvey <dan.harvey@mendeley.com>
To: general@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636499eb3b274920484463356

--001636499eb3b274920484463356
Content-Type: text/plain; charset=ISO-8859-1

Unfortunately the tables I'm importing need application logic to extract the
data full which means I can use sqoop, but I will be using that for all the
simpler tables.

I think I found the problem was with locking the tables in mysql which
caused the mappers to run one at a time, so I've re-wrote the code so I'm
not locking tables now and that's fixed it from happening completely. It's
odd that it didn't always go one at a time though but I guess that might
just been to do with how the mysql server was dealing with the queries at
different times.

Thanks,

On 14 April 2010 17:28, Eric Sammer <esammer@cloudera.com> wrote:

> If you're performing a simple import of an entire table, sqoop may
> make your life easier. It gives you a reasonable command line client
> for importing single tables or an entire database (provided there is a
> JDBC driver available for it). Sqoop comes with Cloudera's
> distribution for Hadoop or you can snag the source from
> http://github.com/cloudera/sqoop
>
> If sqoop isn't appealing for some reason, you can at least take a look
> at the code and see what Aaron did under the hood.
>
> On Tue, Apr 13, 2010 at 8:09 AM, Dan Harvey <dan.harvey@mendeley.com>
> wrote:
> > Right, after sending this e-mail out that started working straight away
> with
> > no changes... So setting the number of mappers in the code using :-
> >
> > job.getConfiguration().setInt("mapred.map.tasks", 4);
> >
> > allowed me to specify the number of splits/map tasks.
> >
> > Which lead me to the second problem I've been getting for awhile. When I
> > start a hadoop job using DBInputFormat as the input if I use 5 splits say
> > one will start straight away and the others will stay in the initializing
> > until it is done then carry on one at a time. This doesn't happen all the
> > time though and using the same code and database some will sometimes
> start
> > in parallel!
> >
> > I've read this has happened to others before but no clear solution was
> found
> > then.
> >
> > Has anyone else had this before or found a way to solve it?
> >
> > Thanks,
> >
> > On 13 April 2010 15:46, Dan Harvey <dan.harvey@mendeley.com> wrote:
> >
> >> Hi,
> >>
> >> I'm importing data from a mysql database using the DBInputFormat to go
> over
> >> the rows in a table and put them into HBase with the mapper but I can't
> find
> >> a way to increase the number of maps it splits the input into. I am
> running
> >> this on a cluster where we have 5 nodes and each node has a maximum of 2
> map
> >> tasks. So for example if I set the number of rows to import to be
> 10,000,000
> >> then there will only 2 maps tasks and use only two of the nodes..
> >>
> >> I've tried increasing the limit manually in the code with :
> >>
> >> job.getConfiguration().setInt("mapred.map.tasks", 4);
> >>
> >> increasing the number on the command line to set the same property, and
> >> also increasing the number of map tasks per node.
> >> But in all cases mapred.map.tasks is set to 2 in the job xml config
> file.
> >>
> >> I've had a look at the code and DBInputFormat splits the total number of
> >> rows over mapred.map.tasks, so I'm guessing it's just getting that to
> >> change.
> >>
> >> It would be great if anyone has any ideas what's going on?
> >>
> >> Thanks,
> >>
> >> --
> >> Dan Harvey | Datamining Engineer
> >> www.mendeley.com/profiles/dan-harvey
> >>
> >> Mendeley Limited | London, UK | www.mendeley.com
> >> Registered in England and Wales | Company Number 6419015
> >>
> >
> >
> >
> > --
> > Dan Harvey | Datamining Engineer
> > www.mendeley.com/profiles/dan-harvey
> >
> > Mendeley Limited | London, UK | www.mendeley.com
> > Registered in England and Wales | Company Number 6419015
> >
>
>
>
> --
> Eric Sammer
> phone: +1-917-287-2675
> twitter: esammer
> data: www.cloudera.com
>


-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

--001636499eb3b274920484463356--