impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitris Tsirogiannis (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
Date Mon, 10 Oct 2016 15:50:54 GMT
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables

Patch Set 3:

File common/thrift/CatalogObjects.thrift:

PS3, Line 344: // Enum listing all possible DISTRIBUTE BY types
             : enum TDistributeType {
             :   HASH,
             :   RANGE,
> I don't see this used
Correct. Removed.

PS3, Line 350: // Parameters needed for hash distribution
             : struct TDistributeByHashParam {
             :   1: required list<string> columns
             :   2: required i32 num_buckets
             : }
             : struct TRangeLiteral {
             :   1: optional i64 int_literal
             :   2: optional string string_literal
             : }
             : struct TRangeLiteralList {
             :   1: required list<TRangeLiteral> values
             : }
             : // A range distribution is identified by a list of columns and a series of
split rows.
             : struct TDistributeByRangeParam {
             :   1: required list<string> columns
             :   2: optional list<TRangeLiteralList> split_rows;
             : }
             : // Parameters for the DISTRIBUTE BY clause. The actual distribution is identified
             : // the type parameter.
             : struct TDistributeParam {
             :   // Set if type is set to HASH
             :   1: optional TDistributeByHashParam by_hash_param;
             :   // Set if type is set to RANGE
             :   2: optional TDistributeByRangeParam by_range_param;
             : }
> I see that these are used in the serialized catalog objects, but given that
We store the distribute by params in the KuduTable class which is then serialized and sent
to al the impalads. So, I think we do need them in this case, unless I misunderstood your

PS3, Line 390: 
             :   // Distribution schemes
             :   4: required list<TDistributeParam> distribute_by
> same for this, can we get rid of it?
See my comment above. We do load and serialize the distribute by params in the KuduTable class.
File common/thrift/JniCatalog.thrift:

PS3, Line 398: CatalogObjects.TDistributeParam
> wrt my comments in CatalogObjects, I guess we'd need them here, but I don't
See my responses to your comments and let me know if it's still not clear.
File fe/src/main/java/org/apache/impala/analysis/

PS3, Line 21:  * Placeholder for the column definition options of a CREATE TABLE statement.
            :  * Contains the list of column definitions and, optionally, the list of column
            :  * specified using the PRIMARY KEY keyword.
> this feels like a weird abstraction... do we really need this class?
This was used only during parsing to store the "(col type, col type, PRIMARY KEY (col, ..))"
part of the create table stmt. I modified the parser so that this isn't needed anymore. Done
File fe/src/main/java/org/apache/impala/analysis/

Line 87:   InsertStmt getInsertStmt() { return insertStmt_; }
> why change visibility?
File fe/src/main/java/org/apache/impala/analysis/

PS3, Line 344: getTblPrimaryKeyColumnNames(),  null,
> Why would there be PK col names?

PS3, Line 344:   
> nit extra space
File fe/src/main/java/org/apache/impala/analysis/

PS3, Line 167:     if (getFileFormat() != THdfsFileFormat.KUDU && KuduTable.KUDU_STORAGE_HANDLER.equals(
             :         getTblProperties().get(KuduTable.KEY_STORAGE_HANDLER))) {
             :       throw new AnalysisException(KUDU_STORAGE_HANDLER_ERROR_MESSAGE);
             :     }
             :     // Avro tables can have empty column defs because they can infer them from
the Avro
             :     // schema. Likewise for external Kudu tables, the schema can be read from
             :     if (getColumnDefs().isEmpty() && getFileFormat() != THdfsFileFormat.AVRO
             :         && getFileFormat() != THdfsFileFormat.KUDU) {
             :       throw new AnalysisException("Table requires at least 1 column");
             :     }
             :     if (getFileFormat() == THdfsFileFormat.KUDU) {
             :       analyzeKuduFormat(analyzer);
             :     } else {
             :       AnalysisUtils.throwIfNotNullOrNotEmpty(getDistributeParams(),
             :           "Only Kudu tables can use DISTRIBUTE BY clause.");
             :       throwIfPrimaryKeysWereSpecified("Only Kudu tables can specify a PRIMARY
             :     }
             :     if (getFileFormat() == THdfsFileFormat.AVRO) {
             :       setColumnDefs(analyzeAvroSchema(analyzer));
             :       if (getColumnDefs().isEmpty()) {
             :         throw new AnalysisException(
             :             "An Avro table requires column definitions or an Avro schema.");
             :       }
             :       AvroSchemaUtils.setFromSerdeComment(getColumnDefs());
             :     }
             :   }
> what if we were to move everything Kudu related here into analyzeKuduFormat
Done. Let me know if that looks any better now.

PS3, Line 279:         throw new AnalysisException(String.format("Column '%s' (type: '%s')
cannot " +
             :             "be used as a primary key. Keys must be integer or string types.",
             :             pkColDef.getColName(), pkColDef.getType().toSql()));
> I don't see a test case for this error
There are a couple in (L1738, L1743).

PS3, Line 277:       ColumnDef pkColDef = getPrimaryKeyColumnDefs().get(i);
             :       if (!KuduUtil.isSupportedKeyType(pkColDef.getType())) {
             :         throw new AnalysisException(String.format("Column '%s' (type: '%s')
cannot " +
             :             "be used as a primary key. Keys must be integer or string types.",
             :             pkColDef.getColName(), pkColDef.getType().toSql()));
             :       }
             :       if (!pkColDef.equals(getColumnDefs().get(i))) {
             :         throw new AnalysisException(String.format("Primary key columns in Kudu
" +
             :             "tables must be declared first and in-order. Key column '%s' is
" +
             :             "out-of-order.", pkColDef.getColName()));
             :       }
> If we weren't to do these checks (i.e. that they're supported types and tha

PS3, Line 315:    * - Only primary key column can be used in distribution definitions
             :    * - Distributions may be either hash or ranged (value intervals)
             :    * - Multiple distributions can be used/mixed
             :    * - A column can only be used in one hash distribution
             :    * - A range distribution can only be used once and must be the last definition
             :    *   (both enforced by the grammar).
> What Kudu supports for partitioning is a bit tricky when we get into nested

PS3, Line 358: throwIfPrimaryKeysWereSpecified
> I do prefer to change this to return a bool and have callers throw: bool ha
File fe/src/main/java/org/apache/impala/analysis/

PS3, Line 156:       if (splitRows_ == null) {
             :         builder.append("...");
> can this happen? wouldn't l115 throw?
This is possible because an object of DistributeParam class is created in two cases: 
1. During the analysis of a create table stmt for Kudu. In this case, split rows for range
partitioning are guaranteed to be set so L115 is ok. 
2. When we load the schema and distribution schemes in the catalog for a Kudu table. In this
case, Kudu's API doesn't currently return any information about the split points so this field
could be null. 

With the advent of proper range partitioning syntax this should go away. Kudu will have to
provide information about the range partitions including split points or ranges. Would it
be ok to add a TODO and reference the pending JIRA? Or let me know if you have another preference.

PS3, Line 191:       if (splitRows_ == null) {
             :         result.setBy_range_param(rangeParam);
             :         return result;
             :       }
> this is possible? what does it mean?
See explanation above.
File fe/src/main/java/org/apache/impala/analysis/

PS3, Line 24: class TableDataLayout {
            :   private final List<ColumnDef> partitionColDefs_;
            :   private final List<DistributeParam> distributeParams_;
> Not that we're doing this in this review, but we need to think about how th
Agreed. Left a TODO.

PS3, Line 49:   List<ColumnDef> getPartitionColumnDefs() { return partitionColDefs_;
            :   List<DistributeParam> getDistributeParams() { return distributeParams_;
> any reason these aren't public?
No need to be public. The default access method gives public access to classes in the same
package and that's all we need here. Let me know if it is confusing.
File fe/src/main/java/org/apache/impala/analysis/

PS3, Line 1: // Copyright 2016 Cloudera Inc.
           : //
           : // Licensed under the Apache License, Version 2.0 (the "License");
           : // you may not use this file except in compliance with the License.
           : // You may obtain a copy of the License at
           : //
           : //
           : //
           : // Unless required by applicable law or agreed to in writing, software
           : // distributed under the License is distributed on an "AS IS" BASIS,
           : // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           : // See the License for the specific language governing permissions and
           : // limitations under the License.
> this and maybe others need the new license header:
arrrgh. Thanks. Done

PS3, Line 126: return fullyQualifiedTableName_ != null ? fullyQualifiedTableName_ : tableName_;
> Do we need to support calling this before analysis? It'd be nice if we just
We do call this function before tableName_ has been analyzed. One example is during the parsing
of CREATE TABLE LIKE  statements where we access the specified table name from a TableDef

PS3, Line 137:   List<DistributeParam> getDistributeParams() {
             :     return dataLayout_.getDistributeParams();
             :   }
> 1line?
1 char off :)
File fe/src/main/java/org/apache/impala/catalog/

PS3, Line 258: Creates a tmp Kudu table. The table is not added to the Kudu storage engine
or the
             :    * HMS.
> I wasn't sure what this meant. How about:

PS3, Line 259: primarily
> remove
File infra/python/deps/requirements.txt:

PS3, Line 83: The kudu-python
            : # version in download_requirements must be kept in sync with this version.
> gotta update download_requirements as well

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Matthew Jacobs <>
Gerrit-HasComments: Yes

View raw message