drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-5142) TestWindowFrame.testUnboundedFollowing relies on side effects
Date Wed, 21 Dec 2016 06:15:58 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15766250#comment-15766250
] 

Paul Rogers edited comment on DRILL-5142 at 12/21/16 6:15 AM:
--------------------------------------------------------------

Tracked down the fundamental problem to Hadoop's QuickSort algorithm used by the external
sort: {{org.apache.hadoop.util.QuickSort}}. QuickSort itself is unstable as it does not guarantee
to retain the order of items that compare equal.

The unit tests were forcing the issue by reducing the output batch size of the first sort,
forcing the second to use a merge of multiple batches, the merge apparently is stable.

Since neither SQL more Drill make guarantees about the stability of sort, we can say that
the unstable sort is not a bug; just a fact of life. Altering the queries is the correct outcome
in this case.

The new, managed, external sort does away with the option to force small output batches (as
doing so causes errors when using Union Vectors.) So, we will not preserve the implementation
artifact in the managed sort but will fix the tests to be correct instead.


was (Author: paul-rogers):
Tracked down the fundamental problem to Hadoop's QuickSort algorithm used by the external
sort: {{org.apache.hadoop.util.QuickSort}}. QuickSort itself is unstable as it does not guarantee
order of items that compare equal.

The unit tests were forcing the issue by reducing the output batch size of the first sort,
forcing the second to use a merge of multiple batches, the merge apparently is stable.

Since neither SQL more Drill make guarantees about the stability of sort, we can say that
the unstable sort is not a bug; just a fact of life. Altering the queries is the correct outcome
in this case.

The new, managed, external sort does away with the option to force small output batches (as
doing so causes errors when using Union Vectors.) So, we will not preserve the implementation
artifact in the managed sort but will fix the tests to be correct instead.

> TestWindowFrame.testUnboundedFollowing relies on side effects
> -------------------------------------------------------------
>
>                 Key: DRILL-5142
>                 URL: https://issues.apache.org/jira/browse/DRILL-5142
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> The unit test {{TestWindowFrame.testUnboundedFollowing}} is one of a family of tests
that does the same query in two different ways, using the results of the second to verify
the first. Unfortunately, this particular tests "works" only because it relies on undefined
implementation artifacts about the way the "verification" query is run in Drill.
> Here is the query under test:
> {code}
> SELECT 
>   position_id,
>   employee_id,
>   LAST_VALUE(employee_id)
>     OVER(PARTITION BY position_id
>          ORDER by employee_id
>          RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `last_value`
> FROM
>   dfs_test.`%s/window/b4.p4`
> {code}
> With expected results as follows:
> {code}
> 1,0,9
> ...
> 1,9,9
> {code}
> Here is the "expected results" query:
> {code}
> SELECT
>   position_id,
>   employee_id,
>   MAX(employee_id) OVER(PARTITION BY position_id) AS `last_value`
> FROM (
>   SELECT *
>   FROM dfs_test.`%s/window/b4.p4`
>   ORDER BY position_id, employee_id
> )
> {code}
> The above happens to produce the correct results only because the query executes in a
single fragment. The query produces correct results with the "unmanaged" external sort, but
produces the following (valid) results with the managed external sort:
> {code}
> 1,0,9
> 1,2,9
> ...
> 1,9,9
> 1,1,9
> {code}
> The query relies on the inner query sort order "showing through" to the outer query.
But, if the query were distributed, the outer query would be unordered. Hence, the verification
query just happened to work, but is not actually valid.
> The proper solution is to modify the verification query to move the ORDER BY to the outer
query:
> {code}
> ...
> FROM (
>   SELECT *
>   FROM dfs_test.`%s/window/b4.p4`
> )
>   ORDER BY position_id, employee_id
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message