impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5976: Remove equivalent class computation in FE
Date Wed, 18 Oct 2017 23:52:56 GMT
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8317 )

Change subject: IMPALA-5976: Remove equivalent class computation in FE
......................................................................


Patch Set 1:

(47 comments)

Very nice.

First wave of comments. I have not exhaustively gone through everything yet since this is
a very substantial patch.

http://gerrit.cloudera.org:8080/#/c/8317/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8317/1//COMMIT_MSG@7
PS1, Line 7: IMPALA-5976: Remove equivalent class computation in FE
Remove equivalence classes from FE


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@a2017
PS1, Line 2017: 
Very satisfying to see this code deleted!


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@a2890
PS1, Line 2890: 
This function was nice for debugging. Does your patch add a similar function that nicely prints
the two graphs?


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1253
PS1, Line 1253:     return !tids.isEmpty() && (tids.size() > 1 || isOjConjunct(e)
|| isFullOuterJoined(e)
This new formatting is extremely hard to read, please reformat for readability. The old formatting
was much better.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1496
PS1, Line 1496:         for (SlotId dst : getEquivSlots(srcSid)) {
Unless we are absolutely convinced that the change is correct, I think we should try to preserve
the semantics of the original code. My understanding is that the new getEquivSlots() returns
slots that have a mutual value transfer with srcSid - which is different than what the original
code did.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1608
PS1, Line 1608:    * For each equivalence class, adds/removes predicates from conjuncts such
that it
Refers to "equivalence class" again which is now an undefined term. If you want to use that
term we should define what it means. 

Let's think about how to evolve the terminology. Based on my understanding of your patch it
looks like you want to redefine "equivalence class" to be a set of slots where each pair of
slots has a mutual value transfer.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1626
PS1, Line 1626:     // Equivalence classes only containing slots belonging to lhsTids.
What does this mean? Is the Integer key the SCC id?


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1802
PS1, Line 1802:         if (slotDesc.getId().asInt() >= g.numVertices()) continue;
I've seen this check in a couple of places, would be nice if that could be handled by 'g'
instead of all callsites.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1952
PS1, Line 1952:   public void computeValueTransfer() {
computeValueTransferGraph


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1972
PS1, Line 1972:   private void constructValueTransferFromEqPredicates(Graph g) {
valueTransfers (plural)


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2047
PS1, Line 2047:    * Returns the ids of slots that are in the same equivalence class as slotId
Comment need updating. The concept of "equivalence class" does not exist anymore and we should
not use that term.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2055
PS1, Line 2055:     for (int dst: g.getSccMembers(g.getSccId(slotId.asInt()))) {
In terms of the g API I think callers should have to care about the existence of SCCs. Can
we hide those details?


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2070
PS1, Line 2070:     Collections.sort(result);
Shouldn't the destinations already be sorted?


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Expr.java
File fe/src/main/java/org/apache/impala/analysis/Expr.java:

http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Expr.java@702
PS1, Line 702:     if (!localCmp.apply(Pair.create(this, that))) return false;
It's very unfortunate that we have to create a new pair object to do a SlotRef comparison.
We should try to avoid that generate an excessive number of objects. Expr.equals() really
is called quite frequently and SlotRef is the most common Expr.

Using a Java Comparator is not ideal because we don't provide ordering. Using a Function is
not ideal because of object generation. Since this is all our own code we could invent a new
more suitable interface that accepts two Exprs or even two SlotRefs directly.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java
File fe/src/main/java/org/apache/impala/util/Graph.java:

http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@1
PS1, Line 1: package org.apache.impala.util;
Apache header


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@18
PS1, Line 18:   // Adjacency list of the graph (src -> [dst]). Its size is numVertices().
Comment on the ordering of elements (they are unordered correct?)


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@19
PS1, Line 19:   private final ArrayList<ArrayList<Integer>> adjList_;
I feel like this should really be int[][] for maximum efficiency. It generates fewer objects,
consumes less memory, is probably slightly faster, and only slightly more difficult to implement.

The numVertices are known and we can manually reallocate/copy when adding elements to the
second dimension. An ArrayList is basically a Object[] so it has object overhead, an additional
level of indirection, and some overhead for get() versus [] for arrays. Yes, a lot of that
is optimized away by the JIT, but it might really matter in this case.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@23
PS1, Line 23:     for (int i = 0; i < numVertices; i ++) adjList_.add(new ArrayList<Integer>());
style: ++i


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@36
PS1, Line 36:     Preconditions.checkState(src < numVertices());
Not sure how useful these Preconditions checks are. We'll even get a more meaningful exception
if we remove them.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@45
PS1, Line 45:   public List<Integer> dests(int src) {
Does this really need to be public?


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@46
PS1, Line 46:     Preconditions.checkState(src >= 0 && src < numVertices());
Don't think this is needed.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@73
PS1, Line 73:   private void reachabilityDfs(int vertex, BitSet visited) {
I'm worried this might blow the stack for huge graphs - this would be a regression from previous
behavior (query that worked before would now fail to generate a plan). Seems safer to implement
the DFS without recursion.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@86
PS1, Line 86:   public Graph getTransitiveClosure() {
Any idea how this compares to the previous transitive closure implementation in terms of speed?
I tend to prefer this new version because it is simpler, but I am worried it might be slower.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@89
PS1, Line 89:       BitSet visited = new BitSet(numVertices());
Create outside of loop and clear() within loop to avoid generating objects / allocating bitset
buffer.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@107
PS1, Line 107:     // Map an original vertex id to its belonging SCC id.
remove "belonging" the "its" already implies that


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@109
PS1, Line 109:     // Map an SCC id to its members (a list of vertex ids in original graph).
(a list of original vertex ids)


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@110
PS1, Line 110:     private final ArrayList<ArrayList<Integer>> sccMemberList_;
Can this be an int[][] for the same reasons stated above?

Also "sccMembers_" seems sufficient


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@114
PS1, Line 114:     private final ArrayList<HashSet<Integer>> transitiveClosureRA_;
Does the HashSet really buy us much versus doing a binary search on the sorted vertex ids?
The HashSet does come with extra space/object overhead.

Do we ever iterate over this HashSet? We definitely need to avoid that due to getting back
elements in a "random" order that changes across JVM versions.

If you convert the adjList to int[][] then you could use Arrays.binarySearch()


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@134
PS1, Line 134:       Preconditions.checkState(src >= 0 && src < numVertices());
Remove these Preconditions checks - they don't seem useful to me.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@144
PS1, Line 144:       Preconditions.checkState(src >= 0 && src < numVertices());
Remove


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@156
PS1, Line 156:     public int getSccId(int vertex) {
Remove


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@180
PS1, Line 180:       Graph transitiveClosure = projectGraphOnScc(g, scc.first).getTransitiveClosure();
Please avoid chaining multiple non-trivial function calls because that makes them easy for
readers to miss. I think one step is left out of the comments:

Step 0: SCC
Step 1: Project graph into SCC space
Step 2: Compute transitive closure on projected graph
...


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@195
PS1, Line 195:      * Compute the strongly connected components.
... using Tarjan's algorithm.

It's a really neat algorithm that deserves a brief description here :)


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@201
PS1, Line 201:       int[] sccId = new int[g.numVertices()];
sccIds


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@204
PS1, Line 204:       int[] index = new int[g.numVertices()];
Needs a better variable name. Not clear what "index" means; "indexes" is better but still
does not convey what is indexed into. Are these really dfsIndexes? I feel like the concept
of this "dfs index" needs to be explained.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@205
PS1, Line 205:       int[] lowLink = new int[g.numVertices()];
lowLinks


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@209
PS1, Line 209:       int indexCounter = 0;
Needs a better var name, what is this counting? Is this something like visitedOrdinal or dfsOrdinal?


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@223
PS1, Line 223:      * @param vertex The vertex currently being searched.
I think we should use vertexId or vid or similar consistently.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@227
PS1, Line 227:      * @param lowLink The mapping from a vertex id ('v') to the lowest DFS
index of
Brief description should also be above L205


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@237
PS1, Line 237:         BitSet onStack, int sccId[], ArrayList<ArrayList<Integer>>
sccMemberList) {
int[] sccId


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@239
PS1, Line 239:       index[vertex] = lowLink[vertex] = indexCounter ++;
index[vertex] = lowLink[vertex] = indexCounter;
++indexCounter;


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@243
PS1, Line 243:         if (index[dst] == -1) {
A few simple comments here and there would make this code easier to follow. For example, there
are 3 cases here, would be nice to describe them like this:

Case 1: dstVid has not been visited...
Case 2: dstVid has been visited and is on the stack ...
Case 3: dstVid has been visited and is not on the stack ...


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@244
PS1, Line 244:           indexCounter = tarjanSccDfs(dst, g, indexCounter, index, lowLink,
stack,
I'm worried that recursion might blow the stack and lead to a regression in behavior (query
that used to work now fails to plan).


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@252
PS1, Line 252:         ArrayList<Integer> sccMembers = new ArrayList<>();
Brief comment about this case/code, e.g. "Create an SCC from all elements on the stack up
to 'vertex'.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@253
PS1, Line 253:         for (int stackTop = -1; stackTop != vertex; ) {
use while()


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@265
PS1, Line 265:      * Project the original graph represented by 'adjList' to a new graph on
SCC space.
Project the original graph 'g' to a new graph in SCC space.


http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@274
PS1, Line 274:         for (int dst : g.dests(src)) {
Instead of having a separate normalize(), can't you remove duplicates right here by keeping
track of the added sccs in a Bitset?



-- 
To view, visit http://gerrit.cloudera.org:8080/8317
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If4cb1d8be46efa8fd61a97048cc79dabe2ffa51a
Gerrit-Change-Number: 8317
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <twang@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Oct 2017 23:52:56 +0000
Gerrit-HasComments: Yes

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message