impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anonymous Coward (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Date Thu, 15 Sep 2016 01:08:09 GMT
stakiar@cloudera.com has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/4419

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions
......................................................................

IMPALA-4101: qgen: Hive join predicates should only contains equality functions

Background:

Hive only supports equi-joins in its JOIN clause, while Postgres and Impala support more
complex functions such as <, <=, >, >=, etc. This change modifies the
QueryGenerator._create_relational_join_condition and
QueryGenerator._create_boolean_func_tree methods to only construct equality join
conditions under certain conditions.

The _create_boolean_func_tree method is invoked via
QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause ->
_create_relational_join_condition -> _create_boolean_func_tree. This method is invoked
when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of functions
that would typically be found in any of these clauses.

Changes:

The parameter "signatures" is added to the method _create_boolean_func_tree, and it lists
out all the allowed signatures the function is allowed to use. Previously, this list of
signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if
"signatures" is not specified, then the code defaults back to the results of that method.
A new method in the DefaultProfile called get_allowed_join_signatures is introduced and
returns a list of function signatures that are allowed within a JOIN clause. The
DefaultProfile allows all given signatures, while the HiveProfile only allows for the
Equals and And functions, as well as any function that operates over only one column.
The reason for these restrictions is that Hive only allows equality joins, does not allow
OR operators in the join clause, and does not allow any operator that operates over
multiple columns.

Note that the _create_boolean_func_tree still allows for OR operators due to some logic
around it's "and_or_fill_ratio" variable. The plan is to fix this in a future patch that
specifically focuses on removing OR operators from Hive JOIN clauses.

Testing:

* Added a new unit test that ensures the HiveProfile only returns equality joins
* Tested against Hive locally
* Tested against Impala via Leopard
* Tested against Impala via the Discrepancy Checker inside an Impala Docker container

Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
---
M tests/comparison/discrepancy_searcher.py
M tests/comparison/query_generator.py
M tests/comparison/query_profile.py
A tests/comparison/tests/hive/test_create_join_clause.py
4 files changed, 110 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stakiar@cloudera.com

Mime
View raw message