phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-3438) Phoenix should be able to optimize LIMIT query with IN clause
Date Thu, 10 Nov 2016 15:38:58 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654360#comment-15654360
] 

James Taylor commented on PHOENIX-3438:
---------------------------------------

The row order has to be preserved across the entire set of rows for the aggregation to be
done in-place by a single scan. Phoenix won't compile a query into multiple scans except for
its parallelization partitioned along row key based on guideposts (i.e. If there are enough
score+container_id rows to span multiple guideposts) or for union (where you're essentially
telling it which scans to run separately). The trade off will be multiple RPCs plus the dedup
you're stuck doing versus a single scan over potentially more data. If Phoenix did optimize
it into multiple scans the way you envisioned, the execution plan would pretty much look the
same as the union query (minus the deduping).

> Phoenix should be able to optimize LIMIT query with IN clause
> -------------------------------------------------------------
>
>                 Key: PHOENIX-3438
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3438
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.8.0
>            Reporter: Joel Palmert
>            Assignee: James Taylor
>            Priority: Critical
>             Fix For: 4.8.3
>
>
> This is taken from a real production issue we're having but has been simplified here.
Steps to repro:
> CREATE TABLE IF NOT EXISTS TEST.TEST (
>     ORGANIZATION_ID CHAR(15) NOT NULL,
>     SCORE INTEGER NOT NULL,
>     ENTITY_ID CHAR(15)
>     CONSTRAINT TEST_PK PRIMARY KEY (
>         ORGANIZATION_ID,
>         SCORE DESC
>     )
> ) VERSIONS=1, MULTI_TENANT=TRUE, REPLICATION_SCOPE=1, TTL=43200
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM test.test
> WHERE organization_id IN ('org1','org3')
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCAN ON 2 KEYS OVER TEST.TEST ['org1           ']
- ['org2           ']
>     SERVER AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT MERGE SORT
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]
> Notice that Phoenix gets and sends all the rows for the two organizations to the client.
Since the ORDER BY is in PK order I would expect it to just 
> 1. Get the first row for each organization (in index order) rather than scanning all
the rows.
> 2. Merge them (only 2 rows in this case)
> 3. Limit 1
> Consider the following query and plan for what I would expect to see:
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM(
> 	SELECT entity_id, score
> 	FROM(
> 		SELECT entity_id, score
> 		FROM test.test
> 		WHERE organization_id = 'org1'
> 		ORDER BY score DESC
> 		LIMIT 1)
> 	UNION ALL
> 	SELECT entity_id, score
> 		FROM(
> 		SELECT entity_id, score
> 		FROM test.test
> 		WHERE organization_id = 'org2'
> 		ORDER BY score DESC
> 		LIMIT 1))
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> UNION ALL OVER 2 QUERIES
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org1           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org2           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
> CLIENT SORTED BY [ENTITY_ID]
> CLIENT AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message