impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitris Tsirogiannis (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
Date Tue, 27 Sep 2016 21:38:23 GMT
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables

Patch Set 1:

Commit Message:

PS1, Line 21: When attempting to create an external
            :   table "foo" in database "bar", Impala will search for a Kudu table
            :   name "" and "bar" (Kudu doesn't have database name spaces
            :   yet.)
> I think this sentence is duplicated by #10. Maybe remove this part. This it

PS1, Line 25:  The Kudu table is now required to exist at the time of creation in
            :   Impala.
> for external tables? or if this is more general can you explain a bit more?

PS1, Line 31: default value
> can it still be overridden?

PS1, Line 39: Eventually Hive should have the needed
            :   functionality and the Kudu delegate (renamed in this patch to KuduCatalogOpExecutor)
            :   can be removed.
> maybe, it's not clear if impala would want to keep the ability to do this o

PS1, Line 54:  If the database is "default" then
            :   the Kudu table name will not include the database.
> hm... I wonder if it'd be easier just to keep it standardized and always ha

PS1, Line 56: 11) Several improvements in the grammar related to the family
            :   of CREATE TABLE statements.
> I don't think this bullet adds much value to the commit message. Everything

PS1, Line 58: 12) Added new tests and modified existing Kudu test to use the new
            :   CREATE TABLE syntax.
> I think this can be removed too, it should be implied there are tests and i

PS1, Line 60: insteads
> instead

PS1, Line 60: source of truth
> Does this mean we always hit Kudu for metadata? In the Catalog? In the impa

PS1, Line 64: Not included in this commit:
            : - Additional column properties such as nullability and compression
            : encodings.
> I don't think you need to include this, we have a separate JIRA for this.
File be/src/catalog/

PS1, Line 47:     {"<init>", "(ZILjava/lang/String;IIZLjava/lang/String;)V",
            :       &catalog_ctor_},
> 1 line
File be/src/service/

PS1, Line 37: // XXX: This flag doesn't seem to be used anywhere. Maybe remove it?
            : DEFINE_bool(load_catalog_at_startup, false, "if true, load all catalog data
at startup");
> agreed. doesn't look like it's used by JniFrontend. Let's remove it.

PS1, Line 63: IPs
> IP addresses.

PS1, Line 63:  
> no space
File common/thrift/CatalogObjects.thrift:

PS1, Line 52: THdfsFileFormat
> Maybe not for this patch since it's already huge, but it'd be great to gene
I agree we need to revisit this. Left a TODO for now.

PS1, Line 61: THdfsCompression
> similarly, this seems unnecessarily specific to Hdfs. Not necessarily somet
Left TODO.
File fe/src/main/cup/sql-parser.cup:

> FYI I'm not looking at this very carefully because it was reviewed previous
Only few things changed mostly wrt specifying the PRIMARY KEY keyword  inside the column definition

PS1, Line 31: ColumnDefOptions
> this class doesn't exist, please add the missing file

PS1, Line 405:   view_column_def_list, view_column_defs;
> separate nonterminals if they dont fit?

PS1, Line 458: key_ident;
> does this not need Boolean like those below?
File fe/src/main/java/com/cloudera/impala/analysis/

> I'm not in love with the TableDef and TableDefOptions stuff because it feel
As we spoke offline, the initial attempt to refactor this logic didn't succeed. I will not
change these classes but I'll make an effort to separate the Kudu-specific logic at the function

PS1, Line 240: getColumnDefs().add(KuduCatalogOpExecutor.REQUIRED_THRIFT_COLUMN_DEF);
> is this needed? I thought there was going to be code that fetches the colum
We do, but that happens in the catalog. This was added to satisfy the thrift spec.

PS1, Line 242:       if (!getTblProperties().containsKey(KuduTable.KEY_TABLE_NAME)) {
             :         getTblProperties().put(KuduTable.KEY_TABLE_NAME,
             :             KuduUtil.getDefaultCreateKuduTableName(getDb(), getTbl()));
             :       }
> hm... can we simplify this and only allow internal tables to be created wit
This is executed only for managed tables. Is this what you meant? Also, I changed the prefix
to be __IMPALA__db.table;

PS1, Line 275:         // Kudu's data distribution rules are enforced here. A good reference
             :         // As
a summary:
             :         // * Only primary key column can be used in distribution definitions
             :         // * Distributions may be either hash or ranged (value intervals)
             :         // * Multiple distributions can be used/mixed
             :         //   * A column can only be used in one hash distribution
             :         //   * A range distribution can only be used once and must be the last
             :         //     (both enforced by the grammar).
> I don't think we should get in the business of validating the Kudu rules si
Actually, I don't find it such a terrible idea. The sooner we figure out invalid params the
better it is for the entire system, i.e. we don't have to make an rpc to the catalog to make
an rpc to Kudu in order to figure out that a non-primary key column was used in a distribution.

PS1, Line 323: throwIfPrimaryKeysWereSpecified
> I think the callers can throw, simpler to have a fn to check: bool hasPrima
I responded below. If you don't like it we can change to check + throw in the caller. I don't
mind either options.

PS1, Line 327: isPrimaryKey
> when will we have PKs from the col defs but not in the tableDef_.getPrimary
We can have PKs from the col defs in the following case:
CREATE TABLE foo (a int PRIMARY KEY)... Even though the callers of this function correspond
to different use cases (external vs non-kudu) the check is similar, i.e. that no PK was specified
anywhere in the CREATE TABLE stmt. I am ok with this but if you don't like it I can change

PS1, Line 327:         if (colDef.isPrimaryKey()) {
             :           hasKey = true;
             :         }
> 1line

PS1, Line 335: throwIfNotPrimaryKeyType
> I only see this called once, please inline it
File fe/src/main/java/com/cloudera/impala/analysis/

PS1, Line 84: BigDecimal
> why BigDecimal? Ultimately this has to resolve to some int for kudu's API. 
I believe because that's what the Parser generates. I changed it to int.

PS1, Line 110: <= 1)
> is 1 bucket actually not valid?
Yes, I verified it using the Kudu client.

PS1, Line 138: colType.isStringType() && !exprType.isStringType()
             :               || colType.isIntegerType() && (!exprType.isIntegerType()
             :                   || exprType.getPrecision() > colType.getPrecision())
> 1. I don't see anything in the Kudu client that explicitly says you can't p
There is a restriction on the types of primary keys (no bool, float or double) which I verified
using the Kudu client; it is also mentioned in Kudu docs (see
Consequently, you can't partition on these types of columns.

In any case, these checks here are for ensuring that the types of split value literals can
be mapped to the types of the partition columns. Do you think we should change the way we
validate this?
File fe/src/main/java/com/cloudera/impala/analysis/

PS1, Line 32: Represents the end of a CREATE TABLE statement. TableDef represents the beginning
            :  * the statement.
> I find this odd. The 'beginning/end of the statement' feels like a parser i
I merged these two classes into TableDef and updated the parser. Let me know what you think.
File fe/src/main/java/com/cloudera/impala/analysis/

PS1, Line 207: HdfsFileFormat.KUDU
> blegh this looks like an oxymoron. I wonder if we can rename this class to 
Yeah, I agree. I left a TODO in the thrift file to change it.

PS1, Line 212: keyCols != null
> why would this be null?

PS1, Line 217: kuduTableName != null
> why would this be null?

PS1, Line 217:       if (kuduTableName != null
             :           && kuduTableName.equals(KuduUtil.getDefaultCreateKuduTableName(
             :               table.getDb().getName(), table.getName()))) {
             :         properties.remove(KuduTable.KEY_TABLE_NAME);
             :       }
> might just be cleaner to leave this in and not have to have 2 cases to test

PS1, Line 223: Preconditions.checkNotNull(kuduDistributeByParams);
> This breaks for tables that were created before this change.
Good point. Removed.
File fe/src/main/java/com/cloudera/impala/catalog/

PS1, Line 39: HdfsFileFormat
> Can you open a JIRA (and leave it in the comment) to refactor this later?
File fe/src/main/java/com/cloudera/impala/catalog/

PS1, Line 222: 
> was this moved somewhere else? Maybe I'll find it as I keep reading...
Added necessary checks in the load() function.

PS1, Line 79: 
            :   // TODO we should have something like KuduConfig.getDefaultConfig()
> do you know what this means / can it be removed now that we're adding defau
Not clear to me. I removed it.

PS1, Line 87: 
            :   // Key to specify the number of tablet replicas.
            :   // TODO(KUDU): Allow modification in alter table.
            :   public static final String KEY_TABLET_REPLICAS = "kudu.num_tablet_replicas";
> After looking a bit more, I see that this is the way the user specifies the
Yeah, as you pointed out, it would be weird from a usability point of view to strip something
that the user specified in tblProperties. I removed stuff that should be in tblProperties
such as the primary key columns. Also removed the TODO. Wrt CREATE TABLE LIKE semantics, let's
talk offline.

PS1, Line 87: 
            :   // Key to specify the number of tablet replicas.
            :   // TODO(KUDU): Allow modification in alter table.
            :   public static final String KEY_TABLET_REPLICAS = "kudu.num_tablet_replicas";
> I don't think we need to store this, even if we wanted to modify this later

Line 92:   public static final String KEY_DISTRIBUTE_BY = "kudu.distribute_by";
> comment

PS1, Line 92:   public static final String KEY_DISTRIBUTE_BY = "kudu.distribute_by";
> I'm not sure we should be storing this. The Kudu client exposes KuduTable.g
Agree. I wasn't aware of the getPartitionSchema() API. I changed the code to use that instead.

Line 153:    * Load the columns from the schema list
> can you add a comment about error handling in this function?

PS1, Line 158:       LOG.error(String.format("Kudu tables must have at least one"
             :           + "key column (had %d), and no more key columns than there are table
columns "
             :           + "(had %d).", keyColumns.size(), schema.size()));
> shouldn't this still fail? if not, can you add a comment why this continues

PS1, Line 184:       LOG.error(String.format("Some key columns were not found in"
             :               + " the set of columns. List of column names: %s, List of key
column names:"
             :               + " %s", Iterables.toString(columnNames), Iterables.toString(keyColumns)));
> why do we continue?

PS1, Line 199:     // Get the table metadata from Kudu
             :     if (reuseMetadata) {
> I'm confused about this. It's not clear to me from the name 'reuseMetadata'

PS1, Line 238: // Update the HMS
             :     if (reuseMetadata) {
             :       try {
             :         client.alter_table(msTable_.getDbName(), msTable_.getTableName(), msTable_);
             :       } catch (TException e) {
             :         throw new TableLoadingException(e.getMessage());
             :       }
             :     }
> Same deal, I'm not sure why reuseMetadata means update the HMS.
File fe/src/main/java/com/cloudera/impala/service/

PS1, Line 220: 
             :   private final static KuduCatalogOpExecutor kuduExecutor_
             :       = new KuduCatalogOpExecutor();
> all static methods, why instantiate it?
Correct. Removed.

PS1, Line 1122:  // The db == null case isn't handled. The only tables this should matter
for are
              :     // Kudu tables. The expectation is Impala will always know about any Kudu
              :     // because Hive doesn't support Kudu yet. 
> I've always had a problem with this comment :/ even after trimming it down 

PS1, Line 1124: When Hive supports Kudu the DDL delegates
              :     // can be removed. tracks
the removal.
> Can you remove these sentences and just leave IMPALA-3424 as a reference? W

PS1, Line 1127: ,
> nit: extra comma

PS1, Line 1150:         // The operation will be aborted if the Kudu table cannot be dropped.
If for
              :         // some reason Kudu is permanently stuck in a non-functional state,
the user is
              :         // expected to ALTER TABLE to either set the table to UNMANAGED or
set the format
              :         // to something else.
> JIRA? Even if we don't fix it we should have something public to point to r
Not sure what the JIRA should say in this case. I'll look into the tests and see if I can
create a more specific scenario which can be documented.

PS1, Line 1205: kuduExecutor_
> static ref

PS1, Line 1406: if (KuduTable.isKuduTable(tbl)) {
              :       return createKuduTable(tbl, params.if_not_exists, params.getDistribute_by(),
              :           response);
              :     }
              :     return createTable(tbl, params.if_not_exists, params.getCache_op(), response);
              :   }
> it's too bad we have to just branch like this and handle everything differe

PS1, Line 1420:  createMetaStoreTable
> I'd prefer to have this on the prev line and the parameter on the new line

Line 1475:    * creation of a managed Kudu table.
> comment on response param

PS1, Line 1480: TDdlExecResponse response)
> I think this just fits on the prev line

PS1, Line 1483: kuduExecutor_
> static reference KuduCatalogOpExecutor, here elsewhere in this fn

PS1, Line 1509:       // Add the table to the catalog cache
              :       Table newTbl = catalog_.addTable(newTable.getDbName(), newTable.getTableName());
              :       addTableToCatalogUpdate(newTbl, response.result);
> While I don't think these will throw, it might be worth wrapping all the lo
Let me think about this a bit more. In theory, in this case the user should be able to recover
by doing a REFRESH on the table.
File fe/src/main/java/com/cloudera/impala/service/

PS1, Line 230: impaladCatalog_.getDefaultKuduMasterAddrs()
> this is fine but a little weird, normally I'm all about removing extra stat
File fe/src/main/java/com/cloudera/impala/service/

PS1, Line 83: 
            :       int otherLogLevel, boolean allowAuthToLocal, String kerberosPrincipal)
> wrapping
File fe/src/main/java/com/cloudera/impala/service/

PS1, Line 29: import org.apache.hadoop.hive.metastore.api.T
> since this file references our Table class we should import ours and refere

PS1, Line 45: yet
> remove

PS1, Line 45: When Hive
            :  * functionality is available, that should be preferred to the functionality
            :  * tracks this.
> remove text since we might not. JIRA can stay but is actually a different i

PS1, Line 65: com.cloudera.impala.catalog.Table.
> shouldn't need to ref the full namespace

Line 149:    * Reads the schema from a Kudu table and populates 'msTbl' with an equivalent
> If an error occurs we may have partially modified the table and leave it in

PS1, Line 150: if unable to do so.
> if any errors are encountered.

PS1, Line 155:       // Schemas for external tables are not specified by the user in DDL.
Instead the
             :       // user provides a Kudu table name (or uses implicit Hive Metastore to
Kudu mapping
             :       // rules), then the schema is imported from Kudu.
> I'm not sure if this is adding much and I think it's a bit confusing/out of

PS1, Line 162: // Start searching....
> remove

PS1, Line 166: if (!Strings.isNullOrEmpty(dbName) && !dbName.equalsIgnoreCase("default"))
             :           // If it exists, the fully qualified name is preferred.
             :           candidateTableNames.add(dbName + "." + tableName);
             :         }
> i don't think we should bother handling the default case specially. Can we 
File fe/src/main/java/com/cloudera/impala/util/

PS1, Line 30:  * This class wraps an org.apache.kudu.client.KuduClient to transform exceptions
            :  * ImpalaRuntimeExceptions. No additional functionality is provided. See the
            :  * documentation for information about the methods.
> Let's think about whether or not we can avoid this. Obviously this will hav
Yeah, not a big fan of this either. Let's talk about it.
File fe/src/main/java/com/cloudera/impala/util/

PS1, Line 41: org.apache.kudu.
> reference the ext Type and import ours

PS1, Line 191: isSupportedKeyType
> is this their restriction?
yes. Done

PS1, Line 191: com.cloudera.impala.catalog.
> import?

PS1, Line 196:  and the user did
             :    * not provide an explicit Kudu table name
> ... a table, assuming a custom name was not provided.

PS1, Line 201: Catalog.isDefaultDb(metastoreDbName) ?
> you know i'm voting for not special casing this...
I've changed the default behavior for external tables. For managed it may make sense to generate
a default name. No strong opinion though. If you feel it's too confusing, we can change it.
File fe/src/test/java/com/cloudera/impala/analysis/

PS1, Line 28: import com.cloudera.impala.catalog.KuduTable;
            : import junit.framework.Assert;
> nit: organize imports
File infra/python/deps/requirements.txt:

PS1, Line 67: # kudu-python==0.2.0
> When this changes I think there may be some steps involved in "upgrading" t
I didn't have to do anything. bin/ checks the versions and downloads/builds
the new version automatically.

PS1, Line 70:   # These should eventually be removed
            :   unittest2 == 1.1.0
            :     linecache2 == 1.0.0
            :     traceback2 == 1.4.0
> remove
File testdata/bin/

PS1, Line 196: primay
> primary
File testdata/workloads/functional-query/queries/QueryTest/create_kudu.test:

> Also, I don't see any tests for CTAS here or anywhere.
Most of these have been moved elsewhere. I'll add CTAS tests.

> we might still need to have explicit create table tests.
I agree. We do lack integration tests with Kudu. I'll add them in a followup patch.
File testdata/workloads/functional-query/queries/QueryTest/kudu-show-create.test:

> well, we should have some coverage of show create table...
We do in the (TestShowCreateTable class)
File testdata/workloads/functional-query/queries/QueryTest/kudu_partition_ddl.test:

PS1, Line 15: 
> I think this kind of error should still be covered. You can still mismatch 
Covered in analysis tests.

PS1, Line 37: 
> we should still cover this as well.
Covered in analysis tests.

PS1, Line 50: 
> same
Covered in analysis tests.

> Some more questions about losing coverage. Fine if this is now covered by F
Yes, all these cases are covered in the (analysis tests).
File tests/common/

> does this need to be its own file?
Maybe a more python-savvy person can argue if this is best practice. I assume
 is where you put any initialization code/global vars.
File tests/common/

PS1, Line 54:   @classmethod
            :   def get_db_name(cls):
            :     # When py.test runs with the xdist plugin, several processes are started
and each
            :     # process runs some partition of the tests. It's possible that multiple
            :     # will call this method. A random value is generated so the processes won't
            :     # to use the same database at the same time. The value is cached so within
a single
            :     # process the same database name is always used for the class. This doesn't
need to
            :     # be thread-safe since multi-threading is never used.
            :     if not cls.__DB_NAME:
            :       cls.__DB_NAME = \
            :           choice(ascii_lowercase) + "".join(sample(ascii_lowercase + digits,
            :     return cls.__DB_NAME
> I've always disliked this function... I think we should try to use the db f
Duno. I'll check with Michael.

PS1, Line 127:     mapping = {BOOL: "BOOLEAN",
             :         DOUBLE: "DOUBLE",
             :         FLOAT: "FLOAT",
             :         INT16: "SMALLINT",
             :         INT32: "INT",
             :         INT64: "BIGINT",
             :         INT8: "TINYINT",
             :         STRING: "STRING"}
> this gets defined as a local every function invocation, right? how about cr
Hm, well it's only defined once and it is used only in the context of this fn. If you insist
I'll move it outside of this fn.
File tests/custom_cluster/

PS1, Line 25: TestRedaction
> ?
:) Done

PS1, Line 42: temp_kudu_table
> I'd think we could have the test fn take the database fixture and then pass
I'll check with Michael to see what the issue with get _db_name() and if it is ok to replace
it with db fixture, I'll change it here.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Matthew Jacobs <>
Gerrit-HasComments: Yes

View raw message