hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-8089) Ordering is lost when limit is put in outer query
Date Thu, 18 Sep 2014 01:51:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138379#comment-14138379
] 

Sergey Shelukhin edited comment on HIVE-8089 at 9/18/14 1:51 AM:
-----------------------------------------------------------------

I double checked on Tez; even though reducesink sends data in select order of the subquery,
it arrives at the select of the next stage in different order.
Heavily reformatted logs from logging added to ReduceSink, FileSink and Select - time, container
id, part of log line (number for select is just Java object id):
{noformat}
18:36:57,719,1411004090043_0003_01_000003,1807797632 selecting 0, val_0, 
18:36:57,730,1411004090043_0003_01_000003,Reducesinking 0, val_0, 
18:36:57,751,1411004090043_0003_01_000003,1807797632 selecting 0, val_0, 
18:36:57,763,1411004090043_0003_01_000003,Reducesinking 0, val_0, 
18:36:57,784,1411004090043_0003_01_000003,1807797632 selecting 0, val_0, 
18:36:57,795,1411004090043_0003_01_000003,Reducesinking 0, val_0, 
18:36:57,818,1411004090043_0003_01_000003,1807797632 selecting 10, val_10, 
18:36:57,830,1411004090043_0003_01_000003,Reducesinking 10, val_10, 
18:36:57,852,1411004090043_0003_01_000003,1807797632 selecting 100, val_100, 
18:36:57,863,1411004090043_0003_01_000003,Reducesinking 100, val_100, 


18:36:58,096,1411004090043_0003_01_000002,107130125 selecting 100, val_100, 
18:36:58,107,1411004090043_0003_01_000002,1473175752 selecting 100, val_100, 
18:36:58,147,1411004090043_0003_01_000002,Filesinking 100, val_100, 
18:36:58,159,1411004090043_0003_01_000002,107130125 selecting 10, val_10, 
18:36:58,169,1411004090043_0003_01_000002,1473175752 selecting 10, val_10, 
18:36:58,180,1411004090043_0003_01_000002,Filesinking 10, val_10, 
18:36:58,191,1411004090043_0003_01_000002,107130125 selecting 0, val_0, 
18:36:58,202,1411004090043_0003_01_000002,1473175752 selecting 0, val_0, 
18:36:58,213,1411004090043_0003_01_000002,Filesinking 0, val_0, 
18:36:58,224,1411004090043_0003_01_000002,107130125 selecting 0, val_0, 
18:36:58,235,1411004090043_0003_01_000002,1473175752 selecting 0, val_0, 
18:36:58,246,1411004090043_0003_01_000002,Filesinking 0, val_0, 
18:36:58,258,1411004090043_0003_01_000002,107130125 selecting 0, val_0, 
18:36:58,269,1411004090043_0003_01_000002,1473175752 selecting 0, val_0, 
18:36:58,280,1411004090043_0003_01_000002,Filesinking 0, val_0, 
{noformat}

It's good that this came up in q files because on the cluster, with multiple containers potentially
in each stage, ordering will not be possible to enforce.

The issue happens just because stage is added, not because of limit specifically.
So, if order is needed, it needs to be enforced... in current hive contract that doesn't happen.

If CBO moves limits around in such manner it (or Hive) also needs to propagate order by so
it is enforced between stages.
We could do a hacky patch to prevent limit from adding a stage in this case, but I suspect
other cases can also happen...


was (Author: sershe):
I double checked on Tez; even though reducesink sends data in select order of the subquery,
it arrives at the select of the next stage in different order.
Heavily reformatted logs from logging added to ReduceSink, FileSink and Select - container
id, time, part of log line (number for select is just Java object id):
{noformat}
18:36:57,719,1411004090043_0003_01_000003,1807797632 selecting 0, val_0, 
18:36:57,730,1411004090043_0003_01_000003,Reducesinking 0, val_0, 
18:36:57,751,1411004090043_0003_01_000003,1807797632 selecting 0, val_0, 
18:36:57,763,1411004090043_0003_01_000003,Reducesinking 0, val_0, 
18:36:57,784,1411004090043_0003_01_000003,1807797632 selecting 0, val_0, 
18:36:57,795,1411004090043_0003_01_000003,Reducesinking 0, val_0, 
18:36:57,818,1411004090043_0003_01_000003,1807797632 selecting 10, val_10, 
18:36:57,830,1411004090043_0003_01_000003,Reducesinking 10, val_10, 
18:36:57,852,1411004090043_0003_01_000003,1807797632 selecting 100, val_100, 
18:36:57,863,1411004090043_0003_01_000003,Reducesinking 100, val_100, 


18:36:58,096,1411004090043_0003_01_000002,107130125 selecting 100, val_100, 
18:36:58,107,1411004090043_0003_01_000002,1473175752 selecting 100, val_100, 
18:36:58,147,1411004090043_0003_01_000002,Filesinking 100, val_100, 
18:36:58,159,1411004090043_0003_01_000002,107130125 selecting 10, val_10, 
18:36:58,169,1411004090043_0003_01_000002,1473175752 selecting 10, val_10, 
18:36:58,180,1411004090043_0003_01_000002,Filesinking 10, val_10, 
18:36:58,191,1411004090043_0003_01_000002,107130125 selecting 0, val_0, 
18:36:58,202,1411004090043_0003_01_000002,1473175752 selecting 0, val_0, 
18:36:58,213,1411004090043_0003_01_000002,Filesinking 0, val_0, 
18:36:58,224,1411004090043_0003_01_000002,107130125 selecting 0, val_0, 
18:36:58,235,1411004090043_0003_01_000002,1473175752 selecting 0, val_0, 
18:36:58,246,1411004090043_0003_01_000002,Filesinking 0, val_0, 
18:36:58,258,1411004090043_0003_01_000002,107130125 selecting 0, val_0, 
18:36:58,269,1411004090043_0003_01_000002,1473175752 selecting 0, val_0, 
18:36:58,280,1411004090043_0003_01_000002,Filesinking 0, val_0, 
{noformat}

It's good that this came up in q files because on the cluster, with multiple containers potentially
in each stage, ordering will not be possible to enforce.

The issue happens just because stage is added, not because of limit specifically.
So, if order is needed, it needs to be enforced... in current hive contract that doesn't happen.

If CBO moves limits around in such manner it (or Hive) also needs to propagate order by so
it is enforced between stages.
We could do a hacky patch to prevent limit from adding a stage in this case, but I suspect
other cases can also happen...

> Ordering is lost when limit is put in outer query
> -------------------------------------------------
>
>                 Key: HIVE-8089
>                 URL: https://issues.apache.org/jira/browse/HIVE-8089
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Laljo John Pullokkaran
>            Assignee: Sergey Shelukhin
>
> It seems like hive supports order by, limit in sub queries (compiler doesn't complain).
However ordering seems to be lost based on where you place the limit.   I haven't debugged
the issue.
> ex:
> select key, c_int from (select key, c_int from (select key, c_int from t1 order by c_int
limit 5)t1)t1;
> null	NULL
> null	NULL
> 1	1
> 1	1
> 1	1
> select key, c_int from (select key, c_int from (select key, c_int from t1 order by c_int)t1
limit 5)t1;
> 1	1
> 1	1
> 1	1
> null	NULL
> null	NULL
> select key, c_int from (select key, c_int from (select key, c_int from t1 order by c_int
limit 5)t1 limit 5)t1;
> 1	1
> 1	1
> 1	1
> null	NULL
> null	NULL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message