hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonal Goyal <sonalgoy...@gmail.com>
Subject Re: DBInputFormat number of mappers
Date Mon, 19 Apr 2010 02:37:56 GMT
Hi Dan,

You can also check http://code.google.com/p/hiho/ for importing query based
data using bound range queries.

Thanks and Regards,
Sonal
www.meghsoft.com


On Thu, Apr 15, 2010 at 6:39 PM, Dan Harvey <dan.harvey@mendeley.com> wrote:

> Unfortunately the tables I'm importing need application logic to extract
> the
> data full which means I can use sqoop, but I will be using that for all the
> simpler tables.
>
> I think I found the problem was with locking the tables in mysql which
> caused the mappers to run one at a time, so I've re-wrote the code so I'm
> not locking tables now and that's fixed it from happening completely. It's
> odd that it didn't always go one at a time though but I guess that might
> just been to do with how the mysql server was dealing with the queries at
> different times.
>
> Thanks,
>
> On 14 April 2010 17:28, Eric Sammer <esammer@cloudera.com> wrote:
>
> > If you're performing a simple import of an entire table, sqoop may
> > make your life easier. It gives you a reasonable command line client
> > for importing single tables or an entire database (provided there is a
> > JDBC driver available for it). Sqoop comes with Cloudera's
> > distribution for Hadoop or you can snag the source from
> > http://github.com/cloudera/sqoop
> >
> > If sqoop isn't appealing for some reason, you can at least take a look
> > at the code and see what Aaron did under the hood.
> >
> > On Tue, Apr 13, 2010 at 8:09 AM, Dan Harvey <dan.harvey@mendeley.com>
> > wrote:
> > > Right, after sending this e-mail out that started working straight away
> > with
> > > no changes... So setting the number of mappers in the code using :-
> > >
> > > job.getConfiguration().setInt("mapred.map.tasks", 4);
> > >
> > > allowed me to specify the number of splits/map tasks.
> > >
> > > Which lead me to the second problem I've been getting for awhile. When
> I
> > > start a hadoop job using DBInputFormat as the input if I use 5 splits
> say
> > > one will start straight away and the others will stay in the
> initializing
> > > until it is done then carry on one at a time. This doesn't happen all
> the
> > > time though and using the same code and database some will sometimes
> > start
> > > in parallel!
> > >
> > > I've read this has happened to others before but no clear solution was
> > found
> > > then.
> > >
> > > Has anyone else had this before or found a way to solve it?
> > >
> > > Thanks,
> > >
> > > On 13 April 2010 15:46, Dan Harvey <dan.harvey@mendeley.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm importing data from a mysql database using the DBInputFormat to go
> > over
> > >> the rows in a table and put them into HBase with the mapper but I
> can't
> > find
> > >> a way to increase the number of maps it splits the input into. I am
> > running
> > >> this on a cluster where we have 5 nodes and each node has a maximum of
> 2
> > map
> > >> tasks. So for example if I set the number of rows to import to be
> > 10,000,000
> > >> then there will only 2 maps tasks and use only two of the nodes..
> > >>
> > >> I've tried increasing the limit manually in the code with :
> > >>
> > >> job.getConfiguration().setInt("mapred.map.tasks", 4);
> > >>
> > >> increasing the number on the command line to set the same property,
> and
> > >> also increasing the number of map tasks per node.
> > >> But in all cases mapred.map.tasks is set to 2 in the job xml config
> > file.
> > >>
> > >> I've had a look at the code and DBInputFormat splits the total number
> of
> > >> rows over mapred.map.tasks, so I'm guessing it's just getting that to
> > >> change.
> > >>
> > >> It would be great if anyone has any ideas what's going on?
> > >>
> > >> Thanks,
> > >>
> > >> --
> > >> Dan Harvey | Datamining Engineer
> > >> www.mendeley.com/profiles/dan-harvey
> > >>
> > >> Mendeley Limited | London, UK | www.mendeley.com
> > >> Registered in England and Wales | Company Number 6419015
> > >>
> > >
> > >
> > >
> > > --
> > > Dan Harvey | Datamining Engineer
> > > www.mendeley.com/profiles/dan-harvey
> > >
> > > Mendeley Limited | London, UK | www.mendeley.com
> > > Registered in England and Wales | Company Number 6419015
> > >
> >
> >
> >
> > --
> > Eric Sammer
> > phone: +1-917-287-2675
> > twitter: esammer
> > data: www.cloudera.com
> >
>
>
>
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message