impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Apple <jbap...@cloudera.com>
Subject New Impala contributors: IMPALA-5440
Date Wed, 20 Sep 2017 04:55:24 GMT
If you'd like to contribute a patch to Impala, but aren't sure what you
want to work on, you can look at Impala's newbie issues:
https://issues.apache.org/jira/issues/?filter=12341668. You can find
detailed instructions on submitting patches at
https://cwiki.apache.org/confluence/display/IMPALA/Contributing+to+Impala.
This is a walkthrough of a ticket a new contributor could take on, with
hopefully enough detail to get you going but not so much to take away the
fun.

How can we fix https://issues.apache.org/jira/browse/IMPALA-5440, "Add
planner tests with extreme statistics values"? The comments on the ticket
address a number of ways, some of them rather ambitious for a new
contributor, so let's talk about a smaller chunk of it.

This ticket was filed in response to
https://issues.apache.org/jira/browse/IMPALA-5282, which included an
exception in the frontend (which does parsing, analyzing, and planning for
queries) from an overflow. Take a look at the patch which fixed the issue,
https://gerrit.cloudera.org/#/c/7084. It doesn't include any new tests,
which is why IMPALA-5440 was filed. You can see this in the comments on the
patch: "For now, I feel pretty good about the computePerHostResources()
with respect to overflow since I read all the code carefully. We should
still have tests to not break it sometime later. I filed IMPALA-5440 to
address the long-standing bug in test coverage."

Reading the comments on a patch are a good way to understand why something
in Impala is the way it is. All recent Impala patches have a line in the
bottom of the commit message with a URL of the code review so you can do
archaeology for information that wasn't included in the patch itself. All
code review comments are also sent to
https://lists.apache.org/list.html?reviews@impala.apache.org, which you can
subscribe to in the same way you subscribed to this list, by mailing
reviews-subscribe@impala.incubator.apache.org.

In this case, the question to address is arithmetic overflow in the
frontend. The previous patch shows many places where overflow is checked,
and you may be able to add new tests for each line in that patch. For now,
let's just work on two categories of overflow: cardinality estimation and
memory estimates.

Impala's planner, in order to execute a query efficiently, makes estimates
about the number of rows that will be produced by different parts of the
query. If cardinality estimations have arithmetic overflow, they will
estimate a negative number of rows!

To see if you can get arithmetic overflow, start up impala-shell.sh and set
explain_level=2. This will show the planner's estimates on the number of
rows each part of a query produces. Then explain the plans for some cross
joins:

use tpch;
explain select * from lineitem a;
explain select * from lineitem a, lineitem b;
explain select * from lineitem a, lineitem b, lineitem c;
...

At some point in that sequence, you will see that the cardinality estimate
reaches a ceiling, even though those queries would actually produce more
and more rows with each cross-join. This is because the overflow check is
working and capping the cardinality estimate at the largest long value,
2^63 - 1.

To see how to test this, take a look at
fe/src/test/java/org/apache/impala/planner/PlannerTest.java. Each of the
tests in that file references a file in
testdata/workloads/functional-planner/queries/PlannerTest/. To look for a
test that can check that cardinality is bounded, look for the string
"cardinality" in the PlannerTest directory. Check out the test method in
PlannerTest.java that corresponds, and write a similar test file and test
method.

Have fun, and don't hesitate to ask on dev@impala.apache.org if you get
stuck and need help!

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message