Mailing-List: contact commits-help@impala.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@impala.incubator.apache.org
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: tarmstrong@apache.org
To: commits@impala.incubator.apache.org
Date: Fri, 22 Jul 2016 18:03:37 -0000
Message-Id: <467bba84d39640efa8472376dc7cfca7@git.apache.org>
Subject: [1/3] incubator-impala git commit: IMPALA-3864: qgen: reduce
 likelihood of create_query() exceptions
archived-at: Fri, 22 Jul 2016 18:03:44 -0000

Repository: incubator-impala
Updated Branches:
  refs/heads/master e2a70388f -> e0fb432b8


IMPALA-3864: qgen: reduce likelihood of create_query() exceptions

1. Fix a bug in which the computation to produce the string for an
   exception was raising a TypeError. We fix the bug by changing how the
   string is built.

2. Fix a bug in which we tried to choose a relational function (defined
   as taking in more than one argument and returning a Boolean) and were
   looking for its weight in QueryProfile.weights.RELATIONAL_FUNCS, but
   the function wasn't defined in that dictionary. We fix the bug by
   defining weights for those functions.

3. Fix a bug in which QueryProfile.choose_func_signatures() was choosing
   a function without taking into account the set of functions in the
   signatures given to it. We fix the bug by pruning off the weights of
   functions that aren't included in provided signatures. We also add a
   note explaining how the weights defined are "best effort", since
   sometimes functions will be pruned.

4. Add Char signatures to LessThan, GreaterThan, LessThanOrEquals,
   GreaterThanOrEquals. Debugging #3 above brought this to my attention.

5. Make changes to aid in debugging or testing:

   a. Add funcs.Signature representation.
   b. Move the code in query_generator.__main__ to
      generate_queries_for_manual_inspection(); call it from __main__.
   c. Increase the number of fake columns when calling
      generate_queries_for_manual_inspection(), which is useful for
      testing.
   d. Rename a few variables. For some reason in the query_generator
      module there are a lot of overwritten variables, which makes
      debugging difficult.

Testing:

1. impala-python tests/comparison/query_generator.py produces far fewer
   exceptions. The ones for this bug are fixed, but see IMPALA-3890.

2. Full 3 and 6 hour runs of the query generator system don't show any
   obvious regressions in behavior.

Change-Id: Idd9434a92973176aefb99e11e039209cac3cea65
Reviewed-on: http://gerrit.cloudera.org:8080/3720
Tested-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Michael Brown <mikeb@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/e0fb432b
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/e0fb432b
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/e0fb432b

Branch: refs/heads/master
Commit: e0fb432b8261672ff567ec2dafe651c2a92c13ce
Parents: af8b187
Author: Michael Brown <mikeb@cloudera.com>
Authored: Thu Jul 14 17:25:17 2016 -0700
Committer: Tim Armstrong <tarmstrong@cloudera.com>
Committed: Fri Jul 22 11:03:33 2016 -0700

----------------------------------------------------------------------
 tests/comparison/funcs.py           |  7 +++++
 tests/comparison/query_generator.py | 27 ++++++++++++-----
 tests/comparison/query_profile.py   | 51 ++++++++++++++++++++++++--------
 3 files changed, 65 insertions(+), 20 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e0fb432b/tests/comparison/funcs.py
----------------------------------------------------------------------
diff --git a/tests/comparison/funcs.py b/tests/comparison/funcs.py
index c971738..eac0cf3 100644
--- a/tests/comparison/funcs.py
+++ b/tests/comparison/funcs.py
@@ -108,6 +108,11 @@ class Signature(object):
     self.return_type = return_type
     self.args = list(args)
 
+  def __repr__(self):
+    return "Signature<func: {func}, returns: {rt}, args: {arg_list}>".format(
+        func=repr(self.func), rt=repr(self.return_type),
+        arg_list=", ".join([repr(arg) for arg in self.args]))
+
   @property
   def input_types(self):
     return self.args[1:]
@@ -477,11 +482,13 @@ for func_name in ['In', 'NotIn']:
       [Boolean, Timestamp, Timestamp, Timestamp]])
 for comparator in ['GreaterThan', 'LessThan']:
   create_func(comparator, signatures=[
+      [Boolean, Char, Char],
       [Boolean, Number, Number],
       [Boolean, Timestamp, Timestamp]])
 for comparator in ['GreaterThanOrEquals', 'LessThanOrEquals']:
   # Avoid equality comparison on FLOATs
   create_func(comparator, signatures=[
+      [Boolean, Char, Char],
       [Boolean, Decimal, Decimal],
       [Boolean, Decimal, Int],
       [Boolean, Int, Decimal],

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e0fb432b/tests/comparison/query_generator.py
----------------------------------------------------------------------
diff --git a/tests/comparison/query_generator.py b/tests/comparison/query_generator.py
index 834b8de..69bb40b 100644
--- a/tests/comparison/query_generator.py
+++ b/tests/comparison/query_generator.py
@@ -1465,33 +1465,44 @@ class QueryGenerator(object):
     return root_func, relational_funcs
 
 
-if __name__ == '__main__':
+def generate_queries_for_manual_inspection():
   '''Generate some queries for manual inspection. The query won't run anywhere because the
      tables used are fake. To make real queries, we'd need to connect to a database and
      read the table metadata and such.
   '''
+  from model_translator import SqlWriter
+
+  NUM_TABLES = 5
+  NUM_COLS_EACH_TYPE = 10
+  NUM_QUERIES = 3000
+
   tables = list()
   data_types = list(TYPES)
   data_types.remove(Float)
-  for table_idx in xrange(5):
+  for table_idx in xrange(NUM_TABLES):
     table = Table('table_%s' % table_idx)
     tables.append(table)
     cols = table.cols
-    for col_idx in xrange(3):
-      col_type = choice(data_types)
-      col = Column(table, '%s_col_%s' % (col_type.__name__.lower(), col_idx), col_type)
-      cols.append(col)
+    col_idx = 0
+    for _ in xrange(NUM_COLS_EACH_TYPE):
+      for col_type in data_types:
+        col = Column(table, '%s_col_%s' % (col_type.__name__.lower(), col_idx), col_type)
+        cols.append(col)
+        col_idx += 1
     table.cols = cols
 
   query_profile = DefaultProfile()
   query_generator = QueryGenerator(query_profile)
-  from model_translator import SqlWriter
   sql_writer = SqlWriter.create(dialect='IMPALA')
   ref_writer = SqlWriter.create(dialect='POSTGRESQL',
       nulls_order_asc=query_profile.nulls_order_asc())
-  for _ in range(3000):
+  for _ in xrange(NUM_QUERIES):
     query = query_generator.create_query(tables)
     print("Test db")
     print(sql_writer.write_query(query) + '\n')
     print("Ref db")
     print(ref_writer.write_query(query) + '\n')
+
+
+if __name__ == '__main__':
+  generate_queries_for_manual_inspection()

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e0fb432b/tests/comparison/query_profile.py
----------------------------------------------------------------------
diff --git a/tests/comparison/query_profile.py b/tests/comparison/query_profile.py
index 223e34e..893188c 100644
--- a/tests/comparison/query_profile.py
+++ b/tests/comparison/query_profile.py
@@ -25,9 +25,11 @@ from db_types import (
     Timestamp)
 from funcs import (
     And,
+    Coalesce,
     Equals,
     GreaterThan,
     GreaterThanOrEquals,
+    If,
     In,
     IsDistinctFrom,
     IsNotDistinctFrom,
@@ -78,18 +80,40 @@ class DefaultProfile(object):
             Int: 10,
             Timestamp: 1},
         'RELATIONAL_FUNCS': {
+        # The weights below are "best effort" suggestions. Because QueryGenerator
+        # prefers to set column types first, and some functions are "supported" only by
+        # some types, it means functions can be pruned off from this dictionary, and
+        # that will shift the probabilities. A quick example if that if a Char column is
+        # chosen: LessThan may not have a pre-defined signature for Char comparison, so
+        # LessThan shouldn't be chosen with Char columns. The tendency to prune will
+        # shift as the "funcs" module is adjusted to add/remove signatures.
+            And: 2,
+            Coalesce: 2,
             Equals: 40,
             GreaterThan: 2,
             GreaterThanOrEquals: 2,
             In: 2,
+            If: 2,
             IsDistinctFrom: 2,
             IsNotDistinctFrom: 1,
             IsNotDistinctFromOp: 1,
             LessThan: 2,
             LessThanOrEquals: 2,
             NotEquals: 2,
-            NotIn: 2},
+            NotIn: 2,
+            Or: 2,
+         },
         'CONJUNCT_DISJUNCTS': {
+        # And and Or appear both under RELATIONAL_FUNCS and CONJUNCT_DISJUNCTS for the
+        # following reasons:
+        # 1. And and Or are considered "relational" by virtue of taking two arguments
+        # and returning a Boolean. The crude signature selection means they could be
+        # selected, so we describe weights there.
+        # 2. They are set here explicitly as well so that
+        # QueryGenerator._create_bool_func_tree() can create a "more realistic"
+        # expression that has a Boolean operator at the top of the tree by explicitly
+        # asking for an And or Or.
+        # IMPALA-3896 tracks a better way to do this.
             And: 5,
             Or: 1},
         'ANALYTIC_WINDOW': {
@@ -226,14 +250,14 @@ class DefaultProfile(object):
       lower, upper = bounds
     return randint(lower, upper)
 
-  def _choose_from_weights(self, *weights):
+  def _choose_from_weights(self, *weight_args):
     '''Returns a value that is selected from the keys of weights with the probability
        determined by the values of weights.
     '''
-    if isinstance(weights[0], str):
-      weights = self.weights(*weights)
+    if isinstance(weight_args[0], str):
+      weights = self.weights(*weight_args)
     else:
-      weights = weights[0]
+      weights = weight_args[0]
     total_weight = sum(weights.itervalues())
     numeric_choice = randint(1, total_weight)
     for choice_, weight in weights.iteritems():
@@ -430,20 +454,19 @@ class DefaultProfile(object):
     '''
     if not signatures:
       raise Exception('At least one signature is required')
-    signatures = filter(
+    filtered_signatures = filter(
         lambda s: s.return_type == Boolean \
             and len(s.args) > 1 \
             and not any(a.is_subquery for a in s.args),
         signatures)
-    if not signatures:
+    if not filtered_signatures:
       raise Exception(
           'None of the provided signatures corresponded to a relational function')
     func_weights = self.weights('RELATIONAL_FUNCS')
-    missing_funcs = set(s.func for s in signatures) - set(func_weights)
+    missing_funcs = set(s.func for s in filtered_signatures) - set(func_weights)
     if missing_funcs:
-      raise Exception("Weights are missing for functions: %s"
-          % ", ".join([missing_funcs]))
-    return self.choose_func_signature(signatures, self.weights('RELATIONAL_FUNCS'))
+      raise Exception("Weights are missing for functions: {0}".format(missing_funcs))
+    return self.choose_func_signature(filtered_signatures, self.weights('RELATIONAL_FUNCS'))
 
   def choose_func_signature(self, signatures, _func_weights=None):
     '''Return a signature chosen from "signatures".'''
@@ -453,7 +476,11 @@ class DefaultProfile(object):
     type_weights = self.weights('TYPES')
 
     func_weights = _func_weights
-    if not func_weights:
+    if func_weights:
+      distinct_funcs_in_signatures = set([s.func for s in signatures])
+      pruned_func_weights = {f: func_weights[f] for f in distinct_funcs_in_signatures}
+      func_weights = pruned_func_weights
+    else:
       # First a function will be chosen then a signature. This is done so that the number
       # of signatures a function has doesn't influence its likelihood of being chosen.
       # Functions will be weighted based on the weight of the types in their arguments.