hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: DBInputFormat - alternative select strategy?
Date Fri, 16 Oct 2009 20:32:48 GMT
It's actually using mysqldump, piped into HDFS. But, essentially yes.
- Aaron

On Thu, Oct 15, 2009 at 1:34 PM, tim robertson <timrobertson100@gmail.com>wrote:

> Thanks Aaron,
>
> It turned out that it was not easily possible to subclass the existing
> DBInputFormat due to the amount of private variables (presumably for
> security since it deals with DB connections) and inner classes, so I
> just modified the existing one (hacky hacky).  I was going to create a
> patch for a slightly more "subclass friendly" version, but sounds like
> 0.21 has it sorted.
>
> Thanks for the sqoop link - nice to see it has optimised versions.
> Does the mysql --direct do a "select into outfile" piped in to HDFS?
>
> Cheers
> Tim
>
>
>
>
>
> On Thu, Oct 15, 2009 at 10:22 PM, Aaron Kimball <aaron@cloudera.com>
> wrote:
> > Tim,
> >
> > The DataDrivenDBInputFormat, which is in 0.21, does exactly what you
> suggest
> > re. the splits. This is also incorporated in Cloudera's Distribution for
> > Hadoop based on 0.20.1. I'll also point out that there's a command-line
> tool
> > that automates the whole process for you, called "Sqoop"; see
> > www.cloudera.com/hadoop-sqoop
> >
> > - Aaron
> >
> > On Mon, Oct 12, 2009 at 8:11 AM, tim robertson <
> timrobertson100@gmail.com>
> > wrote:
> >>
> >> Thanks Omer!
> >>
> >>
> >> On Mon, Oct 12, 2009 at 5:01 PM, Omer Trajman <omer@vertica.com> wrote:
> >> > For basic extract that's a sensible approach.  You can similarly use
> any
> >> > column with a low number of distinct values.  One caveat is that if
> you
> >> > don't run a count(*) query for each range you won't be able to
> generate
> >> > line number keys the way dbinputformat does.
> >> >
> >> > -Omer
> >> >
> >> > -----Original Message-----
> >> > From: tim robertson [mailto:timrobertson100@gmail.com]
> >> > Sent: Monday, October 12, 2009 10:44 AM
> >> > To: mapreduce-user@hadoop.apache.org
> >> > Subject: DBInputFormat - alternative select strategy?
> >> >
> >> > Hi all,
> >> >
> >> > I've been dumping tables from mysql and loading them manually into
> >> > HDFS, and but decided to look at the DBInputFormat to better automate
> >> > the process.
> >> >
> >> > I see it issuing the "select... from ... order by id limit..." which
> >> > takes ages (several minutes) on my large tables since I use myisam and
> >> > it hangs around on the "sorting result".
> >> >
> >> > Is there anything I should watch out for if I customise the
> >> > DBInputFormat to select the max(id) in the getCount(), and use that to
> >> > create ID ranges for the splits, and then issue the selects with:
> >> >
> >> >  select ... from ... where id between <lower> and <upper> order
by id?
> >> >
> >> > It does mean that they won't be equal splits as there are holes in the
> >> > order, and some might be empty but it is a very fast select statement.
> >> >
> >> > Thanks for any pointers,
> >> >
> >> > Tim
> >> >
> >
> >
>

Mime
View raw message