phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maryann Xue (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost
Date Mon, 12 Feb 2018 21:19:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361456#comment-16361456
] 

Maryann Xue commented on PHOENIX-1556:
--------------------------------------

bq. One thing with PageFilter is that it represents the limit pushed down to the server. Since
the limit cannot always be pushed down (depending on the query - for example an aggregate
query can push down the limit only if it's aggregating on the leading part of the pk), should
we consider that? Or do you think we can reliably get the limit that's pushed to the server
from the query plan?

Thank you for this good reminder! Looks like PageFilter is used in two occasions (let me know
if that's not correct):
1. Set a per scan limit, in which case the total number of bytes scanned will remain the same,
and the cost should be about the same.
2. Push down a limit to the server side for a ScanPlan. The "limit" itself should be taken
into account when calculating cost and byte/row numbers. For example, an order-by with limit
and without limit actually cost differently. For a ScanPlan, whether it is able to "stop"
early on the server side or it depends on the a client-side iterator to stop the scan should
not make a difference in decision making at this point.
That said, your question just reminded me that there's still a little to be done regarding
reflecting "limit" in costs:
1. Scan without order-by but with limit is not costed accurately.
2. Limit pushed down to server-side in a hash-join is not reflected in the cost yet.

There's also another interesting optimization we can do:
For outer joins, we can push limit down to the "preserved" side. We've already done that for
hash-joins, while for sort-merge-joins, it is slightly more complicated as we'd like to push
it further down into the child plan.

I'll open new JIRAs for these improvement tasks. Thanks again for the nice input, [~jamestaylor]!

> Base hash versus sort merge join decision on cost
> -------------------------------------------------
>
>                 Key: PHOENIX-1556
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1556
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: Maryann Xue
>            Priority: Major
>              Labels: CostBasedOptimization
>         Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be scanned for
the RHS table. We should, by default, base the decision of using the hash-join verus many-to-many
join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables being
joined are already ordered by the join key. In that case, it's better to always use the sort
merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message