phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-3689) Not determinist order by with limit
Date Wed, 01 Mar 2017 17:30:45 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890621#comment-15890621
] 

James Taylor commented on PHOENIX-3689:
---------------------------------------

What version of Phoenix and HBase are you using? If not 4.9.0, would you mind trying that
to see if the issue persists? Can you also include the explain plan by doing this:
{code}
explain select dt from TT group by dt order by dt  desc limit 1;
{code}
If you can repro through a standalone test case, that would be ideal. If you can repro it,
it'd definitely be a serious bug.

> Not determinist order by with limit
> -----------------------------------
>
>                 Key: PHOENIX-3689
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3689
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.7.0
>            Reporter: Arthur
>
> The following request does not return the last value of table TT:
> select * from TT order by dt desc limit 1;
> Adding a 'group by dt' clause gets back the good result.
> I noticed that an order by with 'limit 10' returns a merge of 10 results from each region
and not 10 results of the whole request.
> So 'order by' is not determinist. It is a bug or a feature ?
> Here is my DDL:
> {code}
> CREATE TABLE TT (dt timestamp NOT NULL, message bigint NOT NULL, id varchar(20) NOT NULL,
version varchar CONSTRAINT PK PRIMARY KEY (dt, message, id));
> {code}
> The issue occurs with a lot of data. I think the 'order by' clause is done by region
and not for the whole result, so limit 1 returns the first region that answers and phoenix
cache it. With only one region, this does not occur.
> This script generate enough data to throw the issue:
> {code}
> #!/usr/bin/python
> import string
> from datetime import datetime, timedelta
> dt = datetime(2017, 1, 1, 3)
> with open('data.csv', 'w') as file:
>         for i in range(0, 10000000):
>                 newdt = dt + timedelta(microseconds=i*10000)
>                 file.write("{};{};{};\n".format(datetime.strftime(newdt, "%Y-%m-%d %H:%M:%S.%f"),
91 if i  % 10  == 0 else 100, str(i).zfill(20)))
> {code}
> With this data set, the last data is : 2017-01-02 06:46:39.990000
> Result with order by clause is not the last value:
> {noformat}
> select dt from TT order by dt desc limit 1;
> +--------------------------+
> |            DT            |
> +--------------------------+
> | 2017-01-01 07:54:40.730  |
> {noformat}
> Correct result is given when using group by, but I need to get all columns.
> {noformat}
> select dt from TT group by dt order by dt  desc limit 1;
> +--------------------------+
> |            DT            |
> +--------------------------+
> | 2017-01-02 06:46:39.990  |
> +--------------------------+
> {noformat}
> I use a subquery as a workaroud, but performance are not good.
> {noformat}
> select * from TT where dt = ANY(select dt from TT group by dt order by dt desc limit
1);
> +--------------------------+----------+-----------------------+----------+
> |            DT            | MESSAGE  |          ID           | VERSION  |
> +--------------------------+----------+-----------------------+----------+
> | 2017-01-02 06:46:39.990  | 100      | 00000000000009999999  |          |
> +--------------------------+----------+-----------------------+----------+
> 1 row selected (8.393 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message