impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitris Tsirogiannis (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
Date Tue, 18 Oct 2016 07:14:13 GMT
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables

Patch Set 7:


Still a few minor asks to go through (e.g. add JIRA numbers and few clarifications) but I
am sending the next patch.
File be/src/service/

Line 62:     "value should be a comma separated list of hostnames or IP addresses.");
> are ports optional or mandatory?
Ports are optional.
File common/thrift/CatalogObjects.thrift:

Line 358:   1: required list<TRangeLiteral> values
> Why not a list of TExpr that are expected to be literals? Seems more future
Hm, TExpr? That sounds a bit too much for simple literal values unless we envision using exprs,
udfs, etc. Thoughts?

Line 368: // the type parameter.
> which type parameter?
Sorry, there was an enum that got removed and I forgot to update the comments. Done
File common/thrift/JniCatalog.thrift:

Line 398:   14: optional list<CatalogObjects.TDistributeParam> distribute_by;
> for consistency let's remove trailing ";"
File fe/src/main/cup/sql-parser.cup:

Line 1033: // class doesn't inherit from CreateTableStmt. 
> whitespace

Line 1047: // Used for creating external Kudu tables for which the schema is loaded from Kudu.
> There seem to be more uses of this production, so this comment could be mis

Line 1112: // or one RANGE clause
> typo: clauses
File fe/src/main/java/org/apache/impala/analysis/

Line 93:   void setIsPrimaryKey() { isPrimaryKey_ = true; }
> do we need this?
No, removed.

Line 191:       Preconditions.checkState(!colDefsByColName.containsKey(colDef.getColName()));
> can check return value of put()
Good point. Keep forgetting this. Done
File fe/src/main/java/org/apache/impala/analysis/

Line 208:    * Analyzes and checks table properties which are common for both managed and
> typo: common to

Line 255:         "PARTITIONED BY cannot be used with an Kudu table.");
> typo: a Kudu table

Line 273:     AnalysisUtils.throwIfNullOrEmpty(getPrimaryKeyColumnDefs(),
> Shouldn't this check hasPrimaryKey()?

Line 284:             "zero. Given number of replicas is: " + r.toString() + ".'");
> remove trailing .' or add the opening single-quote

Line 318:   private boolean hasPrimaryKey() {
> Isn't it enough to check primaryKeyColDefs_ in tableDef_?
You're right, we can do that since tableDef_ is analyzed.
File fe/src/main/java/org/apache/impala/analysis/

Line 105:     for (String colName: colNames_) {
> we could specify the same distribution column multiple times
Not sure I follow. Are we doing something to prevent this?

Line 127:             throw new AnalysisException("Split values cannot be NULL");
> do we have a test for this?
Hm no, added one. Thanks

Line 223:     colNames_.addAll(colNames);
> do we need toLower()?
We've already lowered the column names before the call to setColumnNames().
File fe/src/main/java/org/apache/impala/analysis/

Line 69:   // Populated during analysis.
> Authoritative list of primary key column definitions populated during analy

Line 177:     fqTableName_.analyze();
> Do you know if Kudu has more permissive or more restrictive constraints on 
Agreed. Will do.

Line 181:     if (analyzer.dbContainsTable(getTblName().getDb(), getTbl(), Privilege.CREATE)
> Are we going to check Sentry privs for Kudu tables? Also ok to defer this f
It's still being discussed :). Lineage as well.

Line 220:    * Analyzes the primary key columns. Primary keys are only supported for Kudu
> Replace the second sentence with a brief description what this checks. It d

Line 234:       StringBuilder columnDefStr = new StringBuilder();
> Not used?
File fe/src/main/java/org/apache/impala/service/

Line 1117:     if (db != null && params.cascade) dropTablesFromKudu(db);
> I think it might be a good idea to do this under the metastoreDdlLock_ as w
File fe/src/main/java/org/apache/impala/service/

Line 47: /**
> newline before

Line 231:     } catch (Exception e) {
> Did you address my comment more nuanced checking here to distinguish connec
I don't think we can differentiate these two given the way the Kudu client works (i.e. it
doesn't give you the ability to check if it is successfully connected to the master). I added
'e' in the construction of ImpalaRuntimeException.
File fe/src/test/java/org/apache/impala/analysis/

Line 1350:     // CTAS in an external Kudu table
> What's the rationale for not allowing this?
Hm, what does it really mean to allow CTAS on external Kudu table? We have the Kudu table
schema and the inferred schema from the select stmt. Do you think we should try to see if
they match and then create the table and perform the insert? It felt kind of weird to allow
something like this. If something fails the user will have to get the schema from Kudu and
figure out what the insert should look like. Instead the user can 1. create an external table,
2. describe to see the table schema, 3. do an insert select. Thoughts?

Line 1716:     AnalyzesOk("create table tab (x int, y string, primary key(x, y)) " +
> Can you add a brief comment explaining the table layout for this one?
Done. Let me know if that's better.

Line 1717:         "distribute by hash(x) into 3 buckets, range(y) split rows " +
> can we distribute by range, hash?
No, range needs to be the last one. Fails during parsing. Added another parser test.

Line 1723:     AnalyzesOk("create table tab (a int, b int, c int, d int, primary key (a, b,
c))" +
> What happens if I specify PARTITION BY for a Kudu table?
You get an error but the error was saying that no DISTRIBUTE BY is used. Fixed it to say that
PARTITIONED BY is not allowed for Kudu tables. Also added test. Done

Line 1739:     AnalysisError(String.format("create table tab (x int) distribute by hash (x)
" +
> What about specifying the other params in tblproperties like the distribute
Well these were never specified in TBLPROPERTIES so I am not sure if we should be checking
for any weird behavior wrt to properties set there. Besides they will be ignored and/or an
error message will be thrown if some crucial clause is missing. Let me know if you have some
specific example in mind.

Line 1748:     AnalysisError("create table tab (x int primary key, primary key(x)) stored
" +
> Add negative test where two ColumnDefs are declared as PK.

Line 1762:     AnalysisError("create table tab (a int, b int, c int, d int, primary key(a,
b, c)) " +
> Do we have tests somewhere for checking which types are supported with Kudu
We do in kudu_create.test and in Added a few more in kudu_create.test. Let me
know if there is anything else missing.

Line 1772:     // No float split keys
> Even if the column type is float? If so, might want to test that.
float is not allowed as a pk column. We have test for that in kudu_create.test.

Line 1810:     // DISTRIBUTE BY is required for managed tables.
> Primary key is also required, do we have a test?
Ha, interestingly not :). Done

Line 1812:         "Table partitioning must be specified for managed Kudu tables.");
> Let's rephrase this as "Table distribution must be specified ..."

PS6, Line 1828: AnalysisError("create external table t tblproperties (" +
> Not that it makes any sense, but hat happens with:
That results in parsing error. Let me know if you want me to add a test in ParserTest.

Line 1863:   public void TestCreateAvroTest() {
> Add a test with some of the Kudu-specific clauses and STORED AS AVRO

Line 2822:   public void TestShowFiles() throws AnalysisException {
> DO we have a TODO/JIRA somewhere to go through all DDL/SHOW commands and ma
We have some tests for SHOW CREATE TABLE in but we need a JIRA for show partitions.
File fe/src/test/java/org/apache/impala/analysis/

Line 1632
> Why remove this? It will break my setup :)
Sorry, I had issues with this on when running parser tests only. Done

> Add case like this:

Line 2541:     ParsesOk("CREATE TABLE Foo TBLPROPERTIES ('a'='b', 'c'='d') AS SELECT * from
> nit: let's be consistent with how we chop up string literals across lines (
File fe/src/test/java/org/apache/impala/catalog/

Line 58:         "org.apache.impala.hive.serde.ParquetInputFormat",
> revert
File testdata/bin/

Line 203:   if row_format and file_format == 'text':
> I think row_format is also valid for sequence files (maybe we're not using 
Hm, I couldn't find anything in the docs or in our tests.
File testdata/datasets/functional/functional_schema_template.sql:

Line 823: distribute by range(id) split rows ((1003), (1007)) stored as kudu;
> ah so much better
File testdata/workloads/functional-query/queries/QueryTest/create-table-like-table.test:

Line 218: create table kudu_table_clone like functional_kudu.alltypes
> Why can't we check this in analysis?
File testdata/workloads/functional-query/queries/QueryTest/kudu_create.test:

Line 3: # This test file contains several cases for what basically amount to analysis errors,
> Should mention that some of this behavior is intentional and why.

Line 11: ImpalaRuntimeException: Type TIMESTAMP is not supported in Kudu
> Do we get the same message for DECIMAL and complex types? (No need to add a
Yes, added tests. Done

Line 14: create table t primary key (id) distribute by hash (id) into 3 buckets
> Add test for creating and querying a table that has Impala keywords as the 
Yeah you were right about using getIdentSql(). This doesn't work as expected. Fixed it and
added a test.
File tests/common/

PS6, Line 47:     super(KuduTestSuite, cls).setup_class()
            :     if os.environ["KUDU_IS_SUPPORTED"] == "false":
            :       pytest.skip("Kudu is not supported")
> Maybe move L47 to L50 (switch the ordering of these)? This isn't just a nit

PS6, Line 51:   @classmethod
            :   def teardown_class(cls):
            :     pass
> Not needed:
File tests/custom_cluster/

PS6, Line 20: from kudu.client import INT32
> ImportError: I think this needs to be kudu.schema

PS6, Line 34:   @classmethod
            :   def add_test_dimensions(cls):
            :     super(CustomClusterTestSuite, cls).add_test_dimensions()
            :     cls.TestMatrix.add_constraint(lambda v:
            :         v.get_value('table_format').file_format == 'parquet' and
            :         v.get_value('table_format').compression_codec == 'none')
> I don't think this is needed: no tests here use a vector.

PS6, Line 43:   def test_kudu_master_hosts(self, cursor, kudu):
> kudu_client, not kudu. Note this isn't just a rename; it's to make use of t

PS6, Line 45:     with self.temp_kudu_table(kudu, [INT32]) as kudu_table:
> kudu_client

PS6, Line 50:             props)
> SyntaxError: missing a closing paren here
File tests/query_test/

PS6, Line 27: from pprint import pprint
> Remove: unused import

PS6, Line 30: from copy import copy
            : from tests.beeswax.impala_beeswax import ImpalaBeeswaxException
            : from tests.common.impala_test_suite import ImpalaTestSuite
            : from tests.common.skip import SkipIf
            : from tests.common.test_dimensions import create_uncompressed_text_dimension
            : from tests.common.test_vector import TestDimension
> All imported but not used.

PS6, Line 49:   def test_kudu_scan_node(self, vector):
> This test is racy in exhaustive: two tests will be competing for the same d
Thanks. Done

Line 57:   def test_insert_update_delete(self, vector, unique_database):
> Can we keep the .test file name and the python test function consistent? i.
File tests/query_test/

Line 192:        'show_create_sql' can be templates that can be used with str.format(). format()
> In our other SHOW CREATE TABLE tests we also try to run execute the output 
See comment below for SPLIT ROWS. Unfortunately, the output is not always executable. To be
fixed in Kudu.

Line 204:         CREATE TABLE {table} (c INT PRIMARY KEY)
> add comment to the PK column
That's interesting. Actually, currently don't work because we can't persist them in Kudu and
the first time the schema is loaded from Kudu we override the HMS column defs, so the comments
are lost. So, option 1 is to store them in tblproperies and option 2 to ignore and/or throw
error if specified for Kudu tables. Thoughts?

Line 213:         TBLPROPERTIES ('kudu.master_addresses'='{kudu_addr}')""".format(
> Why include these TBLPROPERTIES?
Good question. If the user doesn't specify the kudu master_addresses we populate tblproperties
with the known values. The issue is that we can't tell if the user added this property or
it was generated. I chose to always display this property but I can change it if you don't
like it.

> is the SPLIT ROWS (...) legal syntax?
No, but Kudu's API doesn't currently return any information about range partitions except
the column names used.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Matthew Jacobs <>
Gerrit-Reviewer: Michael Brown <>
Gerrit-HasComments: Yes

View raw message