More efficient SQL queries for DBInputFormat
--------------------------------------------
Key: MAPREDUCE-885
URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
Project: Hadoop Map/Reduce
Issue Type: Improvement
Reporter: Aaron Kimball
Assignee: Aaron Kimball
Attachments: MAPREDUCE-885.patch
DBInputFormat generates InputSplits by counting the available rows in a table, and selecting
subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful
in an ordered context, so the query also includes an "ORDER BY" clause on an index column.
The resulting queries are often inefficient and require full table scans. Actually using multiple
mappers with these queries can lead to O(n^2) behavior in the database, where n is the number
of splits. Attempting to use parallelism with these queries is counter-productive.
A better mechanism is to organize splits based on data values themselves, which can be performed
in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism
in the database.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|