Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4B047200B4C for ; Fri, 22 Jul 2016 20:03:44 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 499D2160A6D; Fri, 22 Jul 2016 18:03:44 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 49BFB160A5A for ; Fri, 22 Jul 2016 20:03:43 +0200 (CEST) Received: (qmail 80706 invoked by uid 500); 22 Jul 2016 18:03:42 -0000 Mailing-List: contact commits-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list commits@impala.incubator.apache.org Received: (qmail 80697 invoked by uid 99); 22 Jul 2016 18:03:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Jul 2016 18:03:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 18452C0D8B for ; Fri, 22 Jul 2016 18:03:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.646 X-Spam-Level: X-Spam-Status: No, score=-4.646 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.426] autolearn=disabled Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Xno5trmwy9HT for ; Fri, 22 Jul 2016 18:03:39 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with SMTP id 8D07B60E50 for ; Fri, 22 Jul 2016 18:03:38 +0000 (UTC) Received: (qmail 79778 invoked by uid 99); 22 Jul 2016 18:03:37 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Jul 2016 18:03:37 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 624F9E058E; Fri, 22 Jul 2016 18:03:37 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: tarmstrong@apache.org To: commits@impala.incubator.apache.org Date: Fri, 22 Jul 2016 18:03:37 -0000 Message-Id: <467bba84d39640efa8472376dc7cfca7@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [1/3] incubator-impala git commit: IMPALA-3864: qgen: reduce likelihood of create_query() exceptions archived-at: Fri, 22 Jul 2016 18:03:44 -0000 Repository: incubator-impala Updated Branches: refs/heads/master e2a70388f -> e0fb432b8 IMPALA-3864: qgen: reduce likelihood of create_query() exceptions 1. Fix a bug in which the computation to produce the string for an exception was raising a TypeError. We fix the bug by changing how the string is built. 2. Fix a bug in which we tried to choose a relational function (defined as taking in more than one argument and returning a Boolean) and were looking for its weight in QueryProfile.weights.RELATIONAL_FUNCS, but the function wasn't defined in that dictionary. We fix the bug by defining weights for those functions. 3. Fix a bug in which QueryProfile.choose_func_signatures() was choosing a function without taking into account the set of functions in the signatures given to it. We fix the bug by pruning off the weights of functions that aren't included in provided signatures. We also add a note explaining how the weights defined are "best effort", since sometimes functions will be pruned. 4. Add Char signatures to LessThan, GreaterThan, LessThanOrEquals, GreaterThanOrEquals. Debugging #3 above brought this to my attention. 5. Make changes to aid in debugging or testing: a. Add funcs.Signature representation. b. Move the code in query_generator.__main__ to generate_queries_for_manual_inspection(); call it from __main__. c. Increase the number of fake columns when calling generate_queries_for_manual_inspection(), which is useful for testing. d. Rename a few variables. For some reason in the query_generator module there are a lot of overwritten variables, which makes debugging difficult. Testing: 1. impala-python tests/comparison/query_generator.py produces far fewer exceptions. The ones for this bug are fixed, but see IMPALA-3890. 2. Full 3 and 6 hour runs of the query generator system don't show any obvious regressions in behavior. Change-Id: Idd9434a92973176aefb99e11e039209cac3cea65 Reviewed-on: http://gerrit.cloudera.org:8080/3720 Tested-by: Michael Brown Reviewed-by: Michael Brown Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/e0fb432b Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/e0fb432b Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/e0fb432b Branch: refs/heads/master Commit: e0fb432b8261672ff567ec2dafe651c2a92c13ce Parents: af8b187 Author: Michael Brown Authored: Thu Jul 14 17:25:17 2016 -0700 Committer: Tim Armstrong Committed: Fri Jul 22 11:03:33 2016 -0700 ---------------------------------------------------------------------- tests/comparison/funcs.py | 7 +++++ tests/comparison/query_generator.py | 27 ++++++++++++----- tests/comparison/query_profile.py | 51 ++++++++++++++++++++++++-------- 3 files changed, 65 insertions(+), 20 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e0fb432b/tests/comparison/funcs.py ---------------------------------------------------------------------- diff --git a/tests/comparison/funcs.py b/tests/comparison/funcs.py index c971738..eac0cf3 100644 --- a/tests/comparison/funcs.py +++ b/tests/comparison/funcs.py @@ -108,6 +108,11 @@ class Signature(object): self.return_type = return_type self.args = list(args) + def __repr__(self): + return "Signature".format( + func=repr(self.func), rt=repr(self.return_type), + arg_list=", ".join([repr(arg) for arg in self.args])) + @property def input_types(self): return self.args[1:] @@ -477,11 +482,13 @@ for func_name in ['In', 'NotIn']: [Boolean, Timestamp, Timestamp, Timestamp]]) for comparator in ['GreaterThan', 'LessThan']: create_func(comparator, signatures=[ + [Boolean, Char, Char], [Boolean, Number, Number], [Boolean, Timestamp, Timestamp]]) for comparator in ['GreaterThanOrEquals', 'LessThanOrEquals']: # Avoid equality comparison on FLOATs create_func(comparator, signatures=[ + [Boolean, Char, Char], [Boolean, Decimal, Decimal], [Boolean, Decimal, Int], [Boolean, Int, Decimal], http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e0fb432b/tests/comparison/query_generator.py ---------------------------------------------------------------------- diff --git a/tests/comparison/query_generator.py b/tests/comparison/query_generator.py index 834b8de..69bb40b 100644 --- a/tests/comparison/query_generator.py +++ b/tests/comparison/query_generator.py @@ -1465,33 +1465,44 @@ class QueryGenerator(object): return root_func, relational_funcs -if __name__ == '__main__': +def generate_queries_for_manual_inspection(): '''Generate some queries for manual inspection. The query won't run anywhere because the tables used are fake. To make real queries, we'd need to connect to a database and read the table metadata and such. ''' + from model_translator import SqlWriter + + NUM_TABLES = 5 + NUM_COLS_EACH_TYPE = 10 + NUM_QUERIES = 3000 + tables = list() data_types = list(TYPES) data_types.remove(Float) - for table_idx in xrange(5): + for table_idx in xrange(NUM_TABLES): table = Table('table_%s' % table_idx) tables.append(table) cols = table.cols - for col_idx in xrange(3): - col_type = choice(data_types) - col = Column(table, '%s_col_%s' % (col_type.__name__.lower(), col_idx), col_type) - cols.append(col) + col_idx = 0 + for _ in xrange(NUM_COLS_EACH_TYPE): + for col_type in data_types: + col = Column(table, '%s_col_%s' % (col_type.__name__.lower(), col_idx), col_type) + cols.append(col) + col_idx += 1 table.cols = cols query_profile = DefaultProfile() query_generator = QueryGenerator(query_profile) - from model_translator import SqlWriter sql_writer = SqlWriter.create(dialect='IMPALA') ref_writer = SqlWriter.create(dialect='POSTGRESQL', nulls_order_asc=query_profile.nulls_order_asc()) - for _ in range(3000): + for _ in xrange(NUM_QUERIES): query = query_generator.create_query(tables) print("Test db") print(sql_writer.write_query(query) + '\n') print("Ref db") print(ref_writer.write_query(query) + '\n') + + +if __name__ == '__main__': + generate_queries_for_manual_inspection() http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e0fb432b/tests/comparison/query_profile.py ---------------------------------------------------------------------- diff --git a/tests/comparison/query_profile.py b/tests/comparison/query_profile.py index 223e34e..893188c 100644 --- a/tests/comparison/query_profile.py +++ b/tests/comparison/query_profile.py @@ -25,9 +25,11 @@ from db_types import ( Timestamp) from funcs import ( And, + Coalesce, Equals, GreaterThan, GreaterThanOrEquals, + If, In, IsDistinctFrom, IsNotDistinctFrom, @@ -78,18 +80,40 @@ class DefaultProfile(object): Int: 10, Timestamp: 1}, 'RELATIONAL_FUNCS': { + # The weights below are "best effort" suggestions. Because QueryGenerator + # prefers to set column types first, and some functions are "supported" only by + # some types, it means functions can be pruned off from this dictionary, and + # that will shift the probabilities. A quick example if that if a Char column is + # chosen: LessThan may not have a pre-defined signature for Char comparison, so + # LessThan shouldn't be chosen with Char columns. The tendency to prune will + # shift as the "funcs" module is adjusted to add/remove signatures. + And: 2, + Coalesce: 2, Equals: 40, GreaterThan: 2, GreaterThanOrEquals: 2, In: 2, + If: 2, IsDistinctFrom: 2, IsNotDistinctFrom: 1, IsNotDistinctFromOp: 1, LessThan: 2, LessThanOrEquals: 2, NotEquals: 2, - NotIn: 2}, + NotIn: 2, + Or: 2, + }, 'CONJUNCT_DISJUNCTS': { + # And and Or appear both under RELATIONAL_FUNCS and CONJUNCT_DISJUNCTS for the + # following reasons: + # 1. And and Or are considered "relational" by virtue of taking two arguments + # and returning a Boolean. The crude signature selection means they could be + # selected, so we describe weights there. + # 2. They are set here explicitly as well so that + # QueryGenerator._create_bool_func_tree() can create a "more realistic" + # expression that has a Boolean operator at the top of the tree by explicitly + # asking for an And or Or. + # IMPALA-3896 tracks a better way to do this. And: 5, Or: 1}, 'ANALYTIC_WINDOW': { @@ -226,14 +250,14 @@ class DefaultProfile(object): lower, upper = bounds return randint(lower, upper) - def _choose_from_weights(self, *weights): + def _choose_from_weights(self, *weight_args): '''Returns a value that is selected from the keys of weights with the probability determined by the values of weights. ''' - if isinstance(weights[0], str): - weights = self.weights(*weights) + if isinstance(weight_args[0], str): + weights = self.weights(*weight_args) else: - weights = weights[0] + weights = weight_args[0] total_weight = sum(weights.itervalues()) numeric_choice = randint(1, total_weight) for choice_, weight in weights.iteritems(): @@ -430,20 +454,19 @@ class DefaultProfile(object): ''' if not signatures: raise Exception('At least one signature is required') - signatures = filter( + filtered_signatures = filter( lambda s: s.return_type == Boolean \ and len(s.args) > 1 \ and not any(a.is_subquery for a in s.args), signatures) - if not signatures: + if not filtered_signatures: raise Exception( 'None of the provided signatures corresponded to a relational function') func_weights = self.weights('RELATIONAL_FUNCS') - missing_funcs = set(s.func for s in signatures) - set(func_weights) + missing_funcs = set(s.func for s in filtered_signatures) - set(func_weights) if missing_funcs: - raise Exception("Weights are missing for functions: %s" - % ", ".join([missing_funcs])) - return self.choose_func_signature(signatures, self.weights('RELATIONAL_FUNCS')) + raise Exception("Weights are missing for functions: {0}".format(missing_funcs)) + return self.choose_func_signature(filtered_signatures, self.weights('RELATIONAL_FUNCS')) def choose_func_signature(self, signatures, _func_weights=None): '''Return a signature chosen from "signatures".''' @@ -453,7 +476,11 @@ class DefaultProfile(object): type_weights = self.weights('TYPES') func_weights = _func_weights - if not func_weights: + if func_weights: + distinct_funcs_in_signatures = set([s.func for s in signatures]) + pruned_func_weights = {f: func_weights[f] for f in distinct_funcs_in_signatures} + func_weights = pruned_func_weights + else: # First a function will be chosen then a signature. This is done so that the number # of signatures a function has doesn't influence its likelihood of being chosen. # Functions will be weighted based on the weight of the types in their arguments.