hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tim robertson <timrobertson...@gmail.com>
Subject Re: DBInputFormat - alternative select strategy?
Date Thu, 15 Oct 2009 20:34:02 GMT
Thanks Aaron,

It turned out that it was not easily possible to subclass the existing
DBInputFormat due to the amount of private variables (presumably for
security since it deals with DB connections) and inner classes, so I
just modified the existing one (hacky hacky).  I was going to create a
patch for a slightly more "subclass friendly" version, but sounds like
0.21 has it sorted.

Thanks for the sqoop link - nice to see it has optimised versions.
Does the mysql --direct do a "select into outfile" piped in to HDFS?

Cheers
Tim





On Thu, Oct 15, 2009 at 10:22 PM, Aaron Kimball <aaron@cloudera.com> wrote:
> Tim,
>
> The DataDrivenDBInputFormat, which is in 0.21, does exactly what you suggest
> re. the splits. This is also incorporated in Cloudera's Distribution for
> Hadoop based on 0.20.1. I'll also point out that there's a command-line tool
> that automates the whole process for you, called "Sqoop"; see
> www.cloudera.com/hadoop-sqoop
>
> - Aaron
>
> On Mon, Oct 12, 2009 at 8:11 AM, tim robertson <timrobertson100@gmail.com>
> wrote:
>>
>> Thanks Omer!
>>
>>
>> On Mon, Oct 12, 2009 at 5:01 PM, Omer Trajman <omer@vertica.com> wrote:
>> > For basic extract that's a sensible approach.  You can similarly use any
>> > column with a low number of distinct values.  One caveat is that if you
>> > don't run a count(*) query for each range you won't be able to generate
>> > line number keys the way dbinputformat does.
>> >
>> > -Omer
>> >
>> > -----Original Message-----
>> > From: tim robertson [mailto:timrobertson100@gmail.com]
>> > Sent: Monday, October 12, 2009 10:44 AM
>> > To: mapreduce-user@hadoop.apache.org
>> > Subject: DBInputFormat - alternative select strategy?
>> >
>> > Hi all,
>> >
>> > I've been dumping tables from mysql and loading them manually into
>> > HDFS, and but decided to look at the DBInputFormat to better automate
>> > the process.
>> >
>> > I see it issuing the "select... from ... order by id limit..." which
>> > takes ages (several minutes) on my large tables since I use myisam and
>> > it hangs around on the "sorting result".
>> >
>> > Is there anything I should watch out for if I customise the
>> > DBInputFormat to select the max(id) in the getCount(), and use that to
>> > create ID ranges for the splits, and then issue the selects with:
>> >
>> >  select ... from ... where id between <lower> and <upper> order
by id?
>> >
>> > It does mean that they won't be equal splits as there are holes in the
>> > order, and some might be empty but it is a very fast select statement.
>> >
>> > Thanks for any pointers,
>> >
>> > Tim
>> >
>
>

Mime
View raw message