Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 40143 invoked from network); 9 Sep 2009 06:42:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Sep 2009 06:42:22 -0000 Received: (qmail 4957 invoked by uid 500); 9 Sep 2009 06:42:22 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 4874 invoked by uid 500); 9 Sep 2009 06:42:22 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 4591 invoked by uid 99); 9 Sep 2009 06:42:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2009 06:42:22 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2009 06:42:18 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id BECEF234C48D for ; Tue, 8 Sep 2009 23:41:57 -0700 (PDT) Message-ID: <1094300786.1252478517780.JavaMail.jira@brutus> Date: Tue, 8 Sep 2009 23:41:57 -0700 (PDT) From: "Hadoop QA (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat In-Reply-To: <1403370404.1250639774810.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752937#action_12752937 ] Hadoop QA commented on MAPREDUCE-885: ------------------------------------- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419012/MAPREDUCE-885.6.patch against trunk revision 812546. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/20/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/20/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/20/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/20/console This message is automatically generated. > More efficient SQL queries for DBInputFormat > -------------------------------------------- > > Key: MAPREDUCE-885 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Reporter: Aaron Kimball > Assignee: Aaron Kimball > Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, MAPREDUCE-885.4.patch, MAPREDUCE-885.5.patch, MAPREDUCE-885.6.patch, MAPREDUCE-885.patch > > > DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. > A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.