hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omer Trajman" <o...@vertica.com>
Subject RE: DBInputFormat - alternative select strategy?
Date Mon, 12 Oct 2009 15:01:46 GMT
For basic extract that's a sensible approach.  You can similarly use any
column with a low number of distinct values.  One caveat is that if you
don't run a count(*) query for each range you won't be able to generate
line number keys the way dbinputformat does.


-----Original Message-----
From: tim robertson [mailto:timrobertson100@gmail.com] 
Sent: Monday, October 12, 2009 10:44 AM
To: mapreduce-user@hadoop.apache.org
Subject: DBInputFormat - alternative select strategy?

Hi all,

I've been dumping tables from mysql and loading them manually into
HDFS, and but decided to look at the DBInputFormat to better automate
the process.

I see it issuing the "select... from ... order by id limit..." which
takes ages (several minutes) on my large tables since I use myisam and
it hangs around on the "sorting result".

Is there anything I should watch out for if I customise the
DBInputFormat to select the max(id) in the getCount(), and use that to
create ID ranges for the splits, and then issue the selects with:

  select ... from ... where id between <lower> and <upper> order by id?

It does mean that they won't be equal splits as there are holes in the
order, and some might be empty but it is a very fast select statement.

Thanks for any pointers,


View raw message