hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
Date Tue, 25 Aug 2009 17:51:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747519#action_12747519
] 

Aaron Kimball commented on MAPREDUCE-885:
-----------------------------------------

Enis,

* +1 to getSplitter().
* The reason for not adding DDDBRR (these acronyms are getting to be a mouthful!) is because
I wanted to take advantage of the database-specific RR factory code in DBIF, and the existing
family of db-specific RR's. Otherwise I'll need to add a new MySQLDDDBRR, OracleDDDBRR, etc.,
and any future vendor-specific improvements will require more code duplication to provide
both DBRR and DDDBRR compatibility. (It's times like this I wish we had C++-style multiple
inheritance.) I suppose this is technically "cleaner" but at a cost of many more lines of
code to maintain; any changes made to one DBRR, you have to remember to make to the other,
etc. What do you think?
* We probably don't need the deprecated version. Sqoop is just still using the old API, so
I reflexively added this since I need it. I suppose the correct thing to do is to upgrade
Sqoop to the new API already. :smile:


> More efficient SQL queries for DBInputFormat
> --------------------------------------------
>
>                 Key: MAPREDUCE-885
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-885.patch
>
>
> DBInputFormat generates InputSplits by counting the available rows in a table, and selecting
subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful
in an ordered context, so the query also includes an "ORDER BY" clause on an index column.
The resulting queries are often inefficient and require full table scans. Actually using multiple
mappers with these queries can lead to O(n^2) behavior in the database, where n is the number
of splits. Attempting to use parallelism with these queries is counter-productive.
> A better mechanism is to organize splits based on data values themselves, which can be
performed in the WHERE clause, allowing for index range scans of tables, and can better exploit
parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message