impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-5976: Remove equivalent class computation in FE
Date Wed, 18 Oct 2017 23:52:56 GMT
Alex Behm has posted comments on this change. ( )

Change subject: IMPALA-5976: Remove equivalent class computation in FE

Patch Set 1:


Very nice.

First wave of comments. I have not exhaustively gone through everything yet since this is
a very substantial patch.
Commit Message:
PS1, Line 7: IMPALA-5976: Remove equivalent class computation in FE
Remove equivalence classes from FE
File fe/src/main/java/org/apache/impala/analysis/
PS1, Line 2017: 
Very satisfying to see this code deleted!
PS1, Line 2890: 
This function was nice for debugging. Does your patch add a similar function that nicely prints
the two graphs?
PS1, Line 1253:     return !tids.isEmpty() && (tids.size() > 1 || isOjConjunct(e)
|| isFullOuterJoined(e)
This new formatting is extremely hard to read, please reformat for readability. The old formatting
was much better.
PS1, Line 1496:         for (SlotId dst : getEquivSlots(srcSid)) {
Unless we are absolutely convinced that the change is correct, I think we should try to preserve
the semantics of the original code. My understanding is that the new getEquivSlots() returns
slots that have a mutual value transfer with srcSid - which is different than what the original
code did.
PS1, Line 1608:    * For each equivalence class, adds/removes predicates from conjuncts such
that it
Refers to "equivalence class" again which is now an undefined term. If you want to use that
term we should define what it means. 

Let's think about how to evolve the terminology. Based on my understanding of your patch it
looks like you want to redefine "equivalence class" to be a set of slots where each pair of
slots has a mutual value transfer.
PS1, Line 1626:     // Equivalence classes only containing slots belonging to lhsTids.
What does this mean? Is the Integer key the SCC id?
PS1, Line 1802:         if (slotDesc.getId().asInt() >= g.numVertices()) continue;
I've seen this check in a couple of places, would be nice if that could be handled by 'g'
instead of all callsites.
PS1, Line 1952:   public void computeValueTransfer() {
PS1, Line 1972:   private void constructValueTransferFromEqPredicates(Graph g) {
valueTransfers (plural)
PS1, Line 2047:    * Returns the ids of slots that are in the same equivalence class as slotId
Comment need updating. The concept of "equivalence class" does not exist anymore and we should
not use that term.
PS1, Line 2055:     for (int dst: g.getSccMembers(g.getSccId(slotId.asInt()))) {
In terms of the g API I think callers should have to care about the existence of SCCs. Can
we hide those details?
PS1, Line 2070:     Collections.sort(result);
Shouldn't the destinations already be sorted?
File fe/src/main/java/org/apache/impala/analysis/
PS1, Line 702:     if (!localCmp.apply(Pair.create(this, that))) return false;
It's very unfortunate that we have to create a new pair object to do a SlotRef comparison.
We should try to avoid that generate an excessive number of objects. Expr.equals() really
is called quite frequently and SlotRef is the most common Expr.

Using a Java Comparator is not ideal because we don't provide ordering. Using a Function is
not ideal because of object generation. Since this is all our own code we could invent a new
more suitable interface that accepts two Exprs or even two SlotRefs directly.
File fe/src/main/java/org/apache/impala/util/
PS1, Line 1: package org.apache.impala.util;
Apache header
PS1, Line 18:   // Adjacency list of the graph (src -> [dst]). Its size is numVertices().
Comment on the ordering of elements (they are unordered correct?)
PS1, Line 19:   private final ArrayList<ArrayList<Integer>> adjList_;
I feel like this should really be int[][] for maximum efficiency. It generates fewer objects,
consumes less memory, is probably slightly faster, and only slightly more difficult to implement.

The numVertices are known and we can manually reallocate/copy when adding elements to the
second dimension. An ArrayList is basically a Object[] so it has object overhead, an additional
level of indirection, and some overhead for get() versus [] for arrays. Yes, a lot of that
is optimized away by the JIT, but it might really matter in this case.
PS1, Line 23:     for (int i = 0; i < numVertices; i ++) adjList_.add(new ArrayList<Integer>());
style: ++i
PS1, Line 36:     Preconditions.checkState(src < numVertices());
Not sure how useful these Preconditions checks are. We'll even get a more meaningful exception
if we remove them.
PS1, Line 45:   public List<Integer> dests(int src) {
Does this really need to be public?
PS1, Line 46:     Preconditions.checkState(src >= 0 && src < numVertices());
Don't think this is needed.
PS1, Line 73:   private void reachabilityDfs(int vertex, BitSet visited) {
I'm worried this might blow the stack for huge graphs - this would be a regression from previous
behavior (query that worked before would now fail to generate a plan). Seems safer to implement
the DFS without recursion.
PS1, Line 86:   public Graph getTransitiveClosure() {
Any idea how this compares to the previous transitive closure implementation in terms of speed?
I tend to prefer this new version because it is simpler, but I am worried it might be slower.
PS1, Line 89:       BitSet visited = new BitSet(numVertices());
Create outside of loop and clear() within loop to avoid generating objects / allocating bitset
PS1, Line 107:     // Map an original vertex id to its belonging SCC id.
remove "belonging" the "its" already implies that
PS1, Line 109:     // Map an SCC id to its members (a list of vertex ids in original graph).
(a list of original vertex ids)
PS1, Line 110:     private final ArrayList<ArrayList<Integer>> sccMemberList_;
Can this be an int[][] for the same reasons stated above?

Also "sccMembers_" seems sufficient
PS1, Line 114:     private final ArrayList<HashSet<Integer>> transitiveClosureRA_;
Does the HashSet really buy us much versus doing a binary search on the sorted vertex ids?
The HashSet does come with extra space/object overhead.

Do we ever iterate over this HashSet? We definitely need to avoid that due to getting back
elements in a "random" order that changes across JVM versions.

If you convert the adjList to int[][] then you could use Arrays.binarySearch()
PS1, Line 134:       Preconditions.checkState(src >= 0 && src < numVertices());
Remove these Preconditions checks - they don't seem useful to me.
PS1, Line 144:       Preconditions.checkState(src >= 0 && src < numVertices());
PS1, Line 156:     public int getSccId(int vertex) {
PS1, Line 180:       Graph transitiveClosure = projectGraphOnScc(g, scc.first).getTransitiveClosure();
Please avoid chaining multiple non-trivial function calls because that makes them easy for
readers to miss. I think one step is left out of the comments:

Step 0: SCC
Step 1: Project graph into SCC space
Step 2: Compute transitive closure on projected graph
PS1, Line 195:      * Compute the strongly connected components.
... using Tarjan's algorithm.

It's a really neat algorithm that deserves a brief description here :)
PS1, Line 201:       int[] sccId = new int[g.numVertices()];
PS1, Line 204:       int[] index = new int[g.numVertices()];
Needs a better variable name. Not clear what "index" means; "indexes" is better but still
does not convey what is indexed into. Are these really dfsIndexes? I feel like the concept
of this "dfs index" needs to be explained.
PS1, Line 205:       int[] lowLink = new int[g.numVertices()];
PS1, Line 209:       int indexCounter = 0;
Needs a better var name, what is this counting? Is this something like visitedOrdinal or dfsOrdinal?
PS1, Line 223:      * @param vertex The vertex currently being searched.
I think we should use vertexId or vid or similar consistently.
PS1, Line 227:      * @param lowLink The mapping from a vertex id ('v') to the lowest DFS
index of
Brief description should also be above L205
PS1, Line 237:         BitSet onStack, int sccId[], ArrayList<ArrayList<Integer>>
sccMemberList) {
int[] sccId
PS1, Line 239:       index[vertex] = lowLink[vertex] = indexCounter ++;
index[vertex] = lowLink[vertex] = indexCounter;
PS1, Line 243:         if (index[dst] == -1) {
A few simple comments here and there would make this code easier to follow. For example, there
are 3 cases here, would be nice to describe them like this:

Case 1: dstVid has not been visited...
Case 2: dstVid has been visited and is on the stack ...
Case 3: dstVid has been visited and is not on the stack ...
PS1, Line 244:           indexCounter = tarjanSccDfs(dst, g, indexCounter, index, lowLink,
I'm worried that recursion might blow the stack and lead to a regression in behavior (query
that used to work now fails to plan).
PS1, Line 252:         ArrayList<Integer> sccMembers = new ArrayList<>();
Brief comment about this case/code, e.g. "Create an SCC from all elements on the stack up
to 'vertex'.
PS1, Line 253:         for (int stackTop = -1; stackTop != vertex; ) {
use while()
PS1, Line 265:      * Project the original graph represented by 'adjList' to a new graph on
SCC space.
Project the original graph 'g' to a new graph in SCC space.
PS1, Line 274:         for (int dst : g.dests(src)) {
Instead of having a separate normalize(), can't you remove duplicates right here by keeping
track of the added sccs in a Bitset?

To view, visit
To unsubscribe, visit

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If4cb1d8be46efa8fd61a97048cc79dabe2ffa51a
Gerrit-Change-Number: 8317
Gerrit-PatchSet: 1
Gerrit-Owner: Tianyi Wang <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Comment-Date: Wed, 18 Oct 2017 23:52:56 +0000
Gerrit-HasComments: Yes

  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message