Return-Path: Delivered-To: apmail-hadoop-mapreduce-dev-archive@minotaur.apache.org Received: (qmail 42904 invoked from network); 18 Aug 2009 23:56:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Aug 2009 23:56:18 -0000 Received: (qmail 23542 invoked by uid 500); 18 Aug 2009 23:56:37 -0000 Delivered-To: apmail-hadoop-mapreduce-dev-archive@hadoop.apache.org Received: (qmail 23450 invoked by uid 500); 18 Aug 2009 23:56:36 -0000 Mailing-List: contact mapreduce-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-dev@hadoop.apache.org Received: (qmail 23440 invoked by uid 99); 18 Aug 2009 23:56:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2009 23:56:36 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Aug 2009 23:56:34 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id CED6129A0011 for ; Tue, 18 Aug 2009 16:56:14 -0700 (PDT) Message-ID: <1906410610.1250639774845.JavaMail.jira@brutus> Date: Tue, 18 Aug 2009 16:56:14 -0700 (PDT) From: "Aaron Kimball (JIRA)" To: mapreduce-dev@hadoop.apache.org Subject: [jira] Created: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org More efficient SQL queries for DBInputFormat -------------------------------------------- Key: MAPREDUCE-885 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Aaron Kimball Assignee: Aaron Kimball Attachments: MAPREDUCE-885.patch DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive. A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.