impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Brown (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4467: Add support for DML statements in stress test
Date Wed, 07 Dec 2016 01:39:33 GMT
Michael Brown has posted comments on this change.

Change subject: IMPALA-4467: Add support for DML statements in stress test

Patch Set 2:

Commit Message:

Line 8: 
It would help to be more expansive in the commit message here. How would someone run DML in
the stress test? Do you want to show some useful usage? What assumptions have you made? Etc.

PS2, Line 11: - Update impyla version in order to be able to have access to query
            :   error text for DML queries.
Did you regression test the Impyla update? We have system tests in Kudu that use it. I can
say I smoke tested Impyla 0.14.0 on the random query generator and data generator and it seems

PS2, Line 13: - Made flake8 fixes. flake8 on this file is clean.
It looks great. Thank you!
File tests/stress/

PS2, Line 279:     self._select_probability = 0.5
Nit: I think this should default to None, because the interface to run_queries() requires
for this to be set anyway. I think setting this to None would find bugs where we missed setting
it to a valid value. With defaulting to 0.5, you hide that.

PS2, Line 331:        queries have completed. 'select_probabilty' 
1. spelling: probability

2. Please mention valid values/types for select_probability

3. Please document verify_results

PS2, Line 725:     # set_up_sql accoplishes this task.
spelling: accomplishes

PS2, Line 735:     # If we run this query on a table in initial state, the table remains unchanged
             :     # this is False. (For example running a query like
             :     # "upsert into lineitem select * from lineitem_original" leaves lineitem
unmodified if
             :     # it is in original state.)
             :     self.modifies_table = False
Be a little more explicit here. This claim presumes lineitem's data was already copied into
lineitem_original, right? Out of that context, this is confusing as is the below where Insert
and Upsert have modifies_table set to False.

PS2, Line 740:     # Type of query. Can have the following values: SELECT, COMPUTE_STATS,
             :     # UPSERT, DELETE.
             :     self.query_type = 'SELECT'
Non-blocking suggestion: instead of raw strings create a class that has these as enumerated

class QueryType(object)
  SELECT, COMPUTE_STATS, ... = xrange(6)

The reason for this is that if you mistype one of the identifiers, Python will fail with a
NameError and find your bug. If you mistype a raw string, Python will just run the string
comparison anyway. It lets bugs be found more directly

PS2, Line 794:         if run_set_up and query.set_up_sql:
If only one of these is true, is that a programming error? Do we need to assert if that's
the case?

PS2, Line 1061:   # TODO:
It would be good if TODOs were bound to Jiras. Can you file one, please?

PS2, Line 1411: _orginal
spelling: original

PS2, Line 1417: set( for table in tables):
What is the purpose of creating a set here?

PS2, Line 1427:   """Generate insert, upsert, update, delete DML statements.
              :      For each table in the database that cursor is connected to, create several
              :      one for each mod value in 'dml_mod_values'. This value controls which
rows will be
              :      affected. The generated queries assume that for each table in the database,
              :      exists a table with a '_original' suffix that has unmodified, for example,
tpch data.
              :   """
Non-blocking commment. TL;DR: Move L1429-L1432 3 spaces to the left.


I'm not a fan of this form of docstring for OCD aesthetic reasons, and I see high amount of
it in tests/comparison, so it's on my mind lately. I went to look at PEP-257 and saw that
the valid multiline forms are:

  """first line
  [required empty line that Gerrit does not preserve]
  other lines


  first line
  [required empty line that Gerrit does not preserve]
  other lines

The reason this is so is that the way object.__doc__ gets rendered depends on line number.
For all lines in a docstring except the first, leading space relative to the indentation is

So the way it's done in much of our Python code base is wrong:

  >>> def function():
  ...   """This is my docstring
  ...      ....on multiple lines
  ...   """
  ...   pass
  >>> print function.__doc__
  This is my docstring

       ....on multiple lines


PS2, Line 1434:   tables = [cursor.describe_table(t) for t in cursor.list_table_names()]
Save an indent level after L1437 and change this expression to only include describe(t) if

PS2, Line 1443: numberical
just say integer, since decimals etc. are also numerical

PS2, Line 1449:         insert_query.modifies_table = False
Add a comment explaining why this is False. It seems to me an INSERT absolutely can modify
a table.

PS2, Line 1464:             for col in table.cols if not col.is_primary_key
You could use table.updatable_column_names instead.

PS2, Line 1466:             "UPDATE a SET {3} FROM {0} a JOIN {0}_original b "
              :             "ON a.{1} = b.{1} + 1 WHERE a.{1} % {2} = 0").format(
              :       ,, mod_value, update_list)
fairly complex format string. Please use keys in places of numbers.  You can do "{key}".format(key=value)

PS2, Line 1471:         upsert_query.modifies_table = False
Add a comment explaining why this is False. It seems to me an UPSERT absolutely can modify
a table.

PS2, Line 1505:   return result
Should you assert that len(result) > 0 ? The continues on L1439 and L1445, plus the input
data, could result in a case where no DML statements get added. It would be a shame to silently
run the test. Presumably the caller wanted to run DML statements.

PS2, Line 1541:       search_pattern = r"\.(.*) \("
It would be safer to include "CREATE TABLE" in this pattern.

PS2, Line 1613:   parser.add_argument(
              :       "--mem-limit-padding-abs", type=int, default=0,
              :       help="Pad query mem limits found by solo execution with this value (in
              :       " running concurrently. After padding queries will not be expected to
              :       " due to mem limit exceeded.")
Please document a scenario in which someone would want to use this.

PS2, Line 1623:       "--reset-database"
Nit; this resets all databases, so it should be "--reset-databases"

PS2, Line 1628:   parser.add_argument(
              :       "--dml-mod-values", nargs="+", type=int, default=[11],
              :       help="List of mod values to use for the DML queries.")
Please provide guidance on what this means and how someone would use it.

PS2, Line 1811:   if args.reset_database:
              :     for database in set(query.db_name for query in queries):
              :       with impala.cursor(db_name=database) as cursor:
              :         reset_database(cursor)
Won't this reset *all* databases, not just ones on which you want to run DML against?

PS2, Line 1898:   if args.reset_database:
              :     for database in set(query.db_name for query in queries):
              :       with impala.cursor(db_name=database) as cursor:
              :         reset_database(cursor)
Can you add a comment describing why this is called again?

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia2aafdc6851cc0e1677a3c668d3350e47c4bfe40
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Matthew Jacobs <>
Gerrit-Reviewer: Michael Brown <>
Gerrit-Reviewer: Taras Bobrovytsky <>
Gerrit-HasComments: Yes

View raw message