impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3742: Partitions and sort INSERTs for Kudu tables
Date Sun, 23 Apr 2017 20:29:27 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-3742: Partitions and sort INSERTs for Kudu tables
......................................................................


Patch Set 4:

(24 comments)

only looked at thrift and fe changes so far. let's discuss this in person/over hangout tomorrow.

http://gerrit.cloudera.org:8080/#/c/6559/4/common/thrift/DataSinks.thrift
File common/thrift/DataSinks.thrift:

Line 59:   3: optional Exprs.TExpr partition_expr
how is this related to TDataPartition.partition_exprs?

also, the fact that it's kudu-specific should be reflected in the name.

another naming issue: this returns an index, which isn't the case for the normal partition
exprs. would be good to reflect that in the name, it's a bit confusing to read right now.


http://gerrit.cloudera.org:8080/#/c/6559/4/common/thrift/Exprs.thrift
File common/thrift/Exprs.thrift:

Line 128: struct TKuduPartitionExpr {
this feels redundant


Line 129:   // The Kudu table to use the partitioning scheme from.
how does TTableSink.target_table_id not capture that?


Line 134:   2: required list<Types.TPrimitiveType> types
shouldn't this be part of the descriptor of the target table? ie, shouldn't the kudu sink
get that from elsewhere? same for the other fields in this struct.


Line 138:   3: required list<i32> referenced_columns
what about TKuduTableSink.referenced_columns?


http://gerrit.cloudera.org:8080/#/c/6559/4/common/thrift/Partitions.thrift
File common/thrift/Partitions.thrift:

Line 40:   // and then this can be removed.
could you file a jira for this? the reason this may not just be a cosmetic issue is that by
exposing the actual sequence of partitioning parameters, we can then optimize things like
grouping aggregation and joins that often require repartitioning (by saving that repartitioning
step).


http://gerrit.cloudera.org:8080/#/c/6559/4/fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
File fe/src/main/java/org/apache/impala/analysis/InsertStmt.java:

Line 109:   // Set in analyze(). Exprs corresponding to the partitionKeyValues, if specified,
or to
correspond


Line 114:   // table, eg. partitionKeyExprs_[i] = table_.columns(partitionKeyIdx_[i])
the equality is wrong. partitionkeyexprs_ are the exprs from the select list, not columns.


Line 115:   private List<Integer> partitionKeyIdxs_ = Lists.newArrayList();
why wouldn't this just be 1, 2, 3, ...? partitionkeyexprs is already in table order, not in
select list order.


Line 699:                && partitionKeyExprs_.size() == kuduPartitionByColumnNames.size()));
odd indentation


http://gerrit.cloudera.org:8080/#/c/6559/4/fe/src/main/java/org/apache/impala/analysis/KuduPartitionExpr.java
File fe/src/main/java/org/apache/impala/analysis/KuduPartitionExpr.java:

Line 36:  * Internal expr that calls into the Kudu client to determine the partition index
for
does this really need to exist in the fe, or is this an execution-/be-only construct? it seems
like the metadata is already duplicated elsewhere, at least in the thrift structure, and you
could construct the corresponding be structure with just that.


Line 90:   protected String toSqlImpl() { return "KuduPartition()"; }
this feels wrong. will this ever get called?


http://gerrit.cloudera.org:8080/#/c/6559/4/fe/src/main/java/org/apache/impala/catalog/KuduTable.java
File fe/src/main/java/org/apache/impala/catalog/KuduTable.java:

Line 154:   public Set<String> getPartitionByColumnNames() {
call these partition column names? we also use 'partition exprs' and not 'partition-by exprs'.


http://gerrit.cloudera.org:8080/#/c/6559/4/fe/src/main/java/org/apache/impala/planner/DataPartition.java
File fe/src/main/java/org/apache/impala/planner/DataPartition.java:

Line 45:   // For hash partition: exprs used to compute hash value.
pointless changes, there's no style requirement for this.


http://gerrit.cloudera.org:8080/#/c/6559/4/fe/src/main/java/org/apache/impala/planner/DataStreamSink.java
File fe/src/main/java/org/apache/impala/planner/DataStreamSink.java:

Line 35:   private Expr partitionExpr_;
same naming comment as for TDataStreamSink


http://gerrit.cloudera.org:8080/#/c/6559/4/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java:

Line 247:     Expr partitionExpr = null;
partitionexpr and partitionexprs sounds too similar, and they are distinct concepts.


http://gerrit.cloudera.org:8080/#/c/6559/4/fe/src/main/java/org/apache/impala/planner/PlanFragment.java
File fe/src/main/java/org/apache/impala/planner/PlanFragment.java:

Line 76: public class PlanFragment extends TreeNode<PlanFragment> {
use blank lines where appropriate


Line 99:   private Expr partitionExpr_;
this seems like an operational parameter of the partition (=dataPartition_) itself. why wouldn't
it be part of that? DataPartition stores partitionExprs_, which are an operational parameter
of hash partitions.

how is this related to DSS.partitionExpr_?


Line 107:   private Expr outputPartitionExpr_;
same here


http://gerrit.cloudera.org:8080/#/c/6559/4/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 541:         // Only sort for Kudu if we have a partition expr sp that we can sort the
so

when do we ever not have a partition expr?


Line 543:         orderingExprs.add(inputFragment.getPartitionExpr().clone());
why clone?


http://gerrit.cloudera.org:8080/#/c/6559/4/testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
File testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test:

Line 8: 00:SCAN HDFS [functional.alltypes]
why does sorting not make sense here?


Line 80: # upsert with inline view
an interesting new test case would be an upsert/insert with data supplied by another kudu
table that's also partitioned by bigint_col (and the aggregation would be on that bigint col).
in that case, we would be able to save ourselves two repartitioning steps. i understand that's
not the case at the moment.


Line 131: 01:EXCHANGE [KUDU(functional_kudu.testtbl.id)]
that's unfortunate. can't you very easily detect compatible partitions? if too much work,
file jira and leave todo somewhere.


-- 
To view, visit http://gerrit.cloudera.org:8080/6559
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I84ce0032a1b10958fdf31faef225372c5c38fdc4
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message