hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat
Date Wed, 26 Aug 2009 06:54:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747803#action_12747803
] 

Enis Soztutar commented on MAPREDUCE-885:
-----------------------------------------

The thing is that we have two different strategies (row driven and data driven) for splitting
and several vendor-specific classes. However, I think the current logic in the patch, which
is that the InputFormat determines the strategy, and vendor specific code is automatically
selected i cleaner. Moreover, since DDDBIF does not use non-standard SQL constructs, I think
we won't need any vendor-specific code other than MySQL. 
One more thing, before we go. Could you please revert the changes in OracleDBRR. Thanks. 


> More efficient SQL queries for DBInputFormat
> --------------------------------------------
>
>                 Key: MAPREDUCE-885
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.patch
>
>
> DBInputFormat generates InputSplits by counting the available rows in a table, and selecting
subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful
in an ordered context, so the query also includes an "ORDER BY" clause on an index column.
The resulting queries are often inefficient and require full table scans. Actually using multiple
mappers with these queries can lead to O(n^2) behavior in the database, where n is the number
of splits. Attempting to use parallelism with these queries is counter-productive.
> A better mechanism is to organize splits based on data values themselves, which can be
performed in the WHERE clause, allowing for index range scans of tables, and can better exploit
parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message