Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CFE3E18CE2 for ; Mon, 22 Jun 2015 01:19:00 +0000 (UTC) Received: (qmail 23630 invoked by uid 500); 22 Jun 2015 01:19:00 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 23603 invoked by uid 500); 22 Jun 2015 01:19:00 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 23593 invoked by uid 99); 22 Jun 2015 01:19:00 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Jun 2015 01:19:00 +0000 Date: Mon, 22 Jun 2015 01:19:00 +0000 (UTC) From: "Aman Sinha (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-3246) Query planning support for partition by clause in Drill's CTAS statement MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595278#comment-14595278 ] Aman Sinha commented on DRILL-3246: ----------------------------------- [~jni] Partitioning for CTAS is supported in some MPP systems, although the syntax is not exactly the same, see: http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_AS.html (this has a 'distkey' clause instead of 'partition by'). Since 'partition by' is more closely associated with a DDL statement, it would seem natural to have it after the CREATE TABLE. I do understand the motivation for doing it after the SELECT for cases where the columns are not explicitly provided in the CREATE TABLE or with the '*' column in the SELECT list. However, we should think about potential implications when extending the SELECT syntax. I think we would have to test it more. Could we not require the CTAS to always specify the columns in CREATE TABLE ? This will ensure that aliases, arbitrary expressions in the SELECT list will continue to work. > Query planning support for partition by clause in Drill's CTAS statement > ------------------------------------------------------------------------ > > Key: DRILL-3246 > URL: https://issues.apache.org/jira/browse/DRILL-3246 > Project: Apache Drill > Issue Type: New Feature > Components: Query Planning & Optimization > Affects Versions: 1.0.0 > Reporter: Jinfeng Ni > Assignee: Jinfeng Ni > Fix For: 1.1.0 > > > We are going to add "PARTITION BY" clause in Drill's CTAS statement. The "PARTITION BY" clause will specify the list of columns out of the result table's column list that will be used to partition the data. > CREATE TABLE table_name [ (col_name, .... ) ] > [PARTITION BY (col_name, ...)] > AS SELECT_STATEMENT; > Semantics restriction for the PARTITION BY clause: > - All the columns in the PARTITION BY clause have to be in the table's column list, or the SELECT_STATEMENT has a * column, when the base table in the SELECT_STATEMENT is schema-less. Otherwise, an query validation error would be raised. > - When the partition column is resolved to * column in a schema-less query, this * column could not be a result of join operation. This restriction is added, since for * out of join operation, query planner would not know which table might produce this partition column. > Example : > {code} > create table mytable1 partition by (r_regionkey) as > select r_regionkey, r_name from cp.`tpch/region.parquet` > {code} > {code} > create table mytable2 partition by (r_regionkey) as > select * from cp.`tpch/region.parquet` > {code} > {code} > create table mytable3 partition by (r_regionkey) as > select r.r_regionkey, r.r_name, n.n_nationkey, n.n_name > from cp.`tpch/nation.parquet` n, cp.`tpch/region.parquet` r > where n.n_regionkey = r.r_regionkey > {code} > Invalid case 1: Partition column is not in table's column list. > {code} > create table mytable4 partition by (r_regionkey2) as > select r_regionkey, r_name from cp.`tpch/region.parquet` > {code} > Invalid case 2: Partition column is resolved to * out of a join operator. > {code} > create table mytable5 partition by (r_regionkey) as > select * > from cp.`tpch/nation.parquet` n, cp.`tpch/region.parquet` r > where n.n_regionkey = r.r_regionkey > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)