Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Message-ID: <1922108519.1251186479389.JavaMail.jira@brutus>
Date: Tue, 25 Aug 2009 00:47:59 -0700 (PDT)
From: "Enis Soztutar (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Subject: [jira] Commented: (MAPREDUCE-885) More efficient SQL queries for
 DBInputFormat
In-Reply-To: <1403370404.1250639774810.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747277#action_12747277 ] 

Enis Soztutar commented on MAPREDUCE-885:
-----------------------------------------

Data driven splits are really neat. Just a few suggestions 
- We can add a getSplitter(int sqlDataType) method to DDDBIF and move sql type -> DBSplitter instance mapping, so that classes extending it can easily override this logic, for skewed data, etc. 
- Introduce DDDBRR extending DBRR in DDDBIF and move getDataBasedSelectQuery() as an overridden implementation of getSelectQuery(). 
- Do we need mapred.lib.db.DDDBIF since it is introduced as deprecated. I know that lot's of legacy code is using the old API, but adding a already deprecated class seems odd. 


> More efficient SQL queries for DBInputFormat
> --------------------------------------------
>
>                 Key: MAPREDUCE-885
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-885.patch
>
>
> DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive.
> A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.