hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: DBOutputFormat Speed Issues
Date Sun, 31 Jan 2010 22:03:01 GMT

I'm afraid that right now the only available OutputFormat for JDBC is that
one. You'll note that DBOutputFormat doesn't really include much support for
special-casing to MySQL or other targets.

Your best bet is to probably copy the code from DBOutputFormat and
DBConfiguration into some other class (e.g. MySQLDBOutputFormat) and modify
the code in the RecordWriter to generate PreparedStatements containing
batched insert statements.

If you arrive at a solution which is pretty general-purpose/robust, please
consider contributing it back to the Hadoop project :) If you do so, send me
an email off-list; I'm happy to help with advice on developing better DB
integration code, reviewing your work, etc.

Also on the input side, you should really be using DataDrivenDBInputFormat
instead of the older DBIF :) Sqoop (in src/contrib/sqoop on Apache 0.21 /
CDH 0.20) has pretty good support for parallel imports, and uses this
InputFormat instead.

- Aaron

On Thu, Jan 28, 2010 at 11:39 AM, Nick Jones <nick.jones@amd.com> wrote:

> Hi all,
> I have a use case for collecting several rows from MySQL of
> compressed/unstructured data (n rows), expanding the data set, and storing
> the expanded results back into a MySQL DB (100,000n rows). DBInputFormat
> seems to perform reasonably well but DBOutputFormat is inserting rows
> one-by-one.  How can I take advantage of MySQL's support of generating fewer
> insert statements with more values within each one?
> Thanks.
> --
> Nick Jones

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message