phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PHOENIX-3438) Phoenix should be able to optimize LIMIT query with IN clause
Date Wed, 09 Nov 2016 19:57:58 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651887#comment-15651887
] 

James Taylor edited comment on PHOENIX-3438 at 11/9/16 7:57 PM:
----------------------------------------------------------------

[~jpalmert] - our current optimization for distinct is limited to usage when the order is
preserved for the rows being output. That's not the case here. For example, let's say you
have these rows:
| organization_id | score | entity_id |
| org1 | 50 | e1 |
| org1 | 30 | e5 |
| org1 | 20 | e3 |
| org2 | 90 | e4 |
| org2 | 30 | e2 |

The rows need to be output in this order, which is different than the row key order:
| score | entity_id |
| 90 | e4 |
| 50 | e1 |
| 30 | e5 |
| 30 | e2 |
| 20 | e3 |

If you had an index on SCORE+ENITY_ID+ORGANIZATION_ID it might work, but you wouldn't be able
to do that with your base table being a multi-tenant table (since ORGANIZATION_ID will always
be the first row key column for indexes on multi-tenant tables). But I'm not sure if you're
really varying the ORGANIZATION_ID. It might help if you wrote up an example here with your
CONTAINER_ID instead. In that case if you replace ORGANIZATION_ID above with CONTAINER_ID,
then I think the optimization will be possible. But, I still think there's a missing optimization
for Phoenix, though. I think the limit needs to be pushed to the DistinctPrefixFilter (or
it might work to have a PageFilter *after* the DistinctPrefixFilter). I'll see if that can
be added, as it would not difficult.

FYI, [~lhofhansl].


was (Author: jamestaylor):
[~jpalmert] - our current optimization for distinct is limited to usage when the order is
preserved for the rows being output. That's not the case here. For example, let's say you
have these rows:
| organization_id | score | entity_id |
----------------
| org1 | 50 | e1 |
| org1 | 30 | e5 |
| org1 | 20 | e3 |
| org2 | 90 | e4 |
| org2 | 30 | e2 |
---------------

The rows need to be output in this order, which is different than the row key order:
| score | entity_id |
----------------
| 90 | e4 |
| 50 | e1 |
| 30 | e5 |
| 30 | e2 |
| 20 | e3 |
---------------

If you had an index on SCORE+ENITY_ID+ORGANIZATION_ID it might work, but you wouldn't be able
to do that with your base table being a multi-tenant table (since ORGANIZATION_ID will always
be the first row key column for indexes on multi-tenant tables). But I'm not sure if you're
really varying the ORGANIZATION_ID. It might help if you wrote up an example here with your
CONTAINER_ID instead. In that case if you replace ORGANIZATION_ID above with CONTAINER_ID,
then I think the optimization will be possible. But, I still think there's a missing optimization
for Phoenix, though. I think the limit needs to be pushed to the DistinctPrefixFilter (or
it might work to have a PageFilter *after* the DistinctPrefixFilter). I'll see if that can
be added, as it would not difficult.

FYI, [~lhofhansl].

> Phoenix should be able to optimize LIMIT query with IN clause
> -------------------------------------------------------------
>
>                 Key: PHOENIX-3438
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3438
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.8.0
>            Reporter: Joel Palmert
>            Assignee: James Taylor
>            Priority: Critical
>             Fix For: 4.8.3
>
>
> This is taken from a real production issue we're having but has been simplified here.
Steps to repro:
> CREATE TABLE IF NOT EXISTS TEST.TEST (
>     ORGANIZATION_ID CHAR(15) NOT NULL,
>     SCORE INTEGER NOT NULL,
>     ENTITY_ID CHAR(15)
>     CONSTRAINT TEST_PK PRIMARY KEY (
>         ORGANIZATION_ID,
>         SCORE DESC
>     )
> ) VERSIONS=1, MULTI_TENANT=TRUE, REPLICATION_SCOPE=1, TTL=43200
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM test.test
> WHERE organization_id IN ('org1','org3')
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCAN ON 2 KEYS OVER TEST.TEST ['org1           ']
- ['org2           ']
>     SERVER AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT MERGE SORT
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]
> Notice that Phoenix gets and sends all the rows for the two organizations to the client.
Since the ORDER BY is in PK order I would expect it to just 
> 1. Get the first row for each organization (in index order) rather than scanning all
the rows.
> 2. Merge them (only 2 rows in this case)
> 3. Limit 1
> Consider the following query and plan for what I would expect to see:
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM(
> 	SELECT entity_id, score
> 	FROM(
> 		SELECT entity_id, score
> 		FROM test.test
> 		WHERE organization_id = 'org1'
> 		ORDER BY score DESC
> 		LIMIT 1)
> 	UNION ALL
> 	SELECT entity_id, score
> 		FROM(
> 		SELECT entity_id, score
> 		FROM test.test
> 		WHERE organization_id = 'org2'
> 		ORDER BY score DESC
> 		LIMIT 1))
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> UNION ALL OVER 2 QUERIES
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org1           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org2           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
> CLIENT SORTED BY [ENTITY_ID]
> CLIENT AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message