hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: DBInputFormat number of mappers
Date Thu, 15 Apr 2010 16:20:50 GMT
Hi Dan,

It's also worth pointing out that DBInputFormat's queries are written in
such a way as to make parallelism more likely to hurt than to help. Each
mapper submits a query to the database that does a full table scan followed
by an ORDER BY clause which has to run database-side. Only after this part
of the plan is complete does it select the appropriate output data slice
with LIMIT and OFFSET.

The moral of this story: If you're using CDH, consider using
DataDrivenDBInputFormat instead, which generates queries that actually run
in linear time :) (DataDrivenDBIF is also in Apache trunk, so those of you
running the straight Apache source will get it in the next release.)

Also, if you're using Sqoop, consider just importing the data with
--as-sequencefile so that the records are dumped in a way that fully
preserves everything, including binary data. Then run a second MapReduce job
that runs your extraction code over that input.

- Aaron


On Thu, Apr 15, 2010 at 6:09 AM, Dan Harvey <dan.harvey@mendeley.com> wrote:

> Unfortunately the tables I'm importing need application logic to extract
> the
> data full which means I can use sqoop, but I will be using that for all the
> simpler tables.
>
> I think I found the problem was with locking the tables in mysql which
> caused the mappers to run one at a time, so I've re-wrote the code so I'm
> not locking tables now and that's fixed it from happening completely. It's
> odd that it didn't always go one at a time though but I guess that might
> just been to do with how the mysql server was dealing with the queries at
> different times.
>
> Thanks,
>
> On 14 April 2010 17:28, Eric Sammer <esammer@cloudera.com> wrote:
>
> > If you're performing a simple import of an entire table, sqoop may
> > make your life easier. It gives you a reasonable command line client
> > for importing single tables or an entire database (provided there is a
> > JDBC driver available for it). Sqoop comes with Cloudera's
> > distribution for Hadoop or you can snag the source from
> > http://github.com/cloudera/sqoop
> >
> > If sqoop isn't appealing for some reason, you can at least take a look
> > at the code and see what Aaron did under the hood.
> >
> > On Tue, Apr 13, 2010 at 8:09 AM, Dan Harvey <dan.harvey@mendeley.com>
> > wrote:
> > > Right, after sending this e-mail out that started working straight away
> > with
> > > no changes... So setting the number of mappers in the code using :-
> > >
> > > job.getConfiguration().setInt("mapred.map.tasks", 4);
> > >
> > > allowed me to specify the number of splits/map tasks.
> > >
> > > Which lead me to the second problem I've been getting for awhile. When
> I
> > > start a hadoop job using DBInputFormat as the input if I use 5 splits
> say
> > > one will start straight away and the others will stay in the
> initializing
> > > until it is done then carry on one at a time. This doesn't happen all
> the
> > > time though and using the same code and database some will sometimes
> > start
> > > in parallel!
> > >
> > > I've read this has happened to others before but no clear solution was
> > found
> > > then.
> > >
> > > Has anyone else had this before or found a way to solve it?
> > >
> > > Thanks,
> > >
> > > On 13 April 2010 15:46, Dan Harvey <dan.harvey@mendeley.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm importing data from a mysql database using the DBInputFormat to go
> > over
> > >> the rows in a table and put them into HBase with the mapper but I
> can't
> > find
> > >> a way to increase the number of maps it splits the input into. I am
> > running
> > >> this on a cluster where we have 5 nodes and each node has a maximum of
> 2
> > map
> > >> tasks. So for example if I set the number of rows to import to be
> > 10,000,000
> > >> then there will only 2 maps tasks and use only two of the nodes..
> > >>
> > >> I've tried increasing the limit manually in the code with :
> > >>
> > >> job.getConfiguration().setInt("mapred.map.tasks", 4);
> > >>
> > >> increasing the number on the command line to set the same property,
> and
> > >> also increasing the number of map tasks per node.
> > >> But in all cases mapred.map.tasks is set to 2 in the job xml config
> > file.
> > >>
> > >> I've had a look at the code and DBInputFormat splits the total number
> of
> > >> rows over mapred.map.tasks, so I'm guessing it's just getting that to
> > >> change.
> > >>
> > >> It would be great if anyone has any ideas what's going on?
> > >>
> > >> Thanks,
> > >>
> > >> --
> > >> Dan Harvey | Datamining Engineer
> > >> www.mendeley.com/profiles/dan-harvey
> > >>
> > >> Mendeley Limited | London, UK | www.mendeley.com
> > >> Registered in England and Wales | Company Number 6419015
> > >>
> > >
> > >
> > >
> > > --
> > > Dan Harvey | Datamining Engineer
> > > www.mendeley.com/profiles/dan-harvey
> > >
> > > Mendeley Limited | London, UK | www.mendeley.com
> > > Registered in England and Wales | Company Number 6419015
> > >
> >
> >
> >
> > --
> > Eric Sammer
> > phone: +1-917-287-2675
> > twitter: esammer
> > data: www.cloudera.com
> >
>
>
>
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message