phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Palmert (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-3438) Phoenix should be able to optimize LIMIT query with IN clause
Date Thu, 10 Nov 2016 08:41:58 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15653449#comment-15653449
] 

Joel Palmert commented on PHOENIX-3438:
---------------------------------------

Ye we're not really varying org_id, thats a simplification. 

The reason I was thinking the optimizer should be able to do this is that the row order is
being preserved within a given org, and the way the query is executed it accesses the rows
for each org separately (it's driving from the org id array) as far as I can tell (SKIP SCAN
ON 2 KEYS from the plan).

I don't see how the the column order in the index you propose is going to work. If org_id
comes after score and entity_id we will have to scan all orgs, rather than only the orgs specified
in the IN clause.

> Phoenix should be able to optimize LIMIT query with IN clause
> -------------------------------------------------------------
>
>                 Key: PHOENIX-3438
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3438
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.8.0
>            Reporter: Joel Palmert
>            Assignee: James Taylor
>            Priority: Critical
>             Fix For: 4.8.3
>
>
> This is taken from a real production issue we're having but has been simplified here.
Steps to repro:
> CREATE TABLE IF NOT EXISTS TEST.TEST (
>     ORGANIZATION_ID CHAR(15) NOT NULL,
>     SCORE INTEGER NOT NULL,
>     ENTITY_ID CHAR(15)
>     CONSTRAINT TEST_PK PRIMARY KEY (
>         ORGANIZATION_ID,
>         SCORE DESC
>     )
> ) VERSIONS=1, MULTI_TENANT=TRUE, REPLICATION_SCOPE=1, TTL=43200
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM test.test
> WHERE organization_id IN ('org1','org3')
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCAN ON 2 KEYS OVER TEST.TEST ['org1           ']
- ['org2           ']
>     SERVER AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT MERGE SORT
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]
> Notice that Phoenix gets and sends all the rows for the two organizations to the client.
Since the ORDER BY is in PK order I would expect it to just 
> 1. Get the first row for each organization (in index order) rather than scanning all
the rows.
> 2. Merge them (only 2 rows in this case)
> 3. Limit 1
> Consider the following query and plan for what I would expect to see:
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM(
> 	SELECT entity_id, score
> 	FROM(
> 		SELECT entity_id, score
> 		FROM test.test
> 		WHERE organization_id = 'org1'
> 		ORDER BY score DESC
> 		LIMIT 1)
> 	UNION ALL
> 	SELECT entity_id, score
> 		FROM(
> 		SELECT entity_id, score
> 		FROM test.test
> 		WHERE organization_id = 'org2'
> 		ORDER BY score DESC
> 		LIMIT 1))
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> UNION ALL OVER 2 QUERIES
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org1           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org2           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
> CLIENT SORTED BY [ENTITY_ID]
> CLIENT AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message