phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maryann Xue (JIRA)" <>
Subject [jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost
Date Fri, 09 Feb 2018 22:21:00 GMT


Maryann Xue commented on PHOENIX-1556:

{quote}Should UNION_DISTINCT_FACTOR be 1.0 since we only support UNION ALL currently?
Since we only support "all", this block won't take effect at all, which means the UNION ALL
row count will just be the sum of its children's row count.
{quote}What's the reasoning behind stripSkipScanFilter? Is that removed because it's effect
is already incorporated into the bytes scanned estimate?
Yes. {{stripSkipScanFilter()}} also aims to eliminate things like PageFilter and looks to
keep only boolean expression filters that cannot be pushed into PK.
{quote}Should RowCountVisitor have a method for distinct? In particular, there's an optimization
we have when doing a distinct on the leading PK columns which impacts cost. This optimization
is not identified until runtime, so we might need to tweak the code so we know about it at
compile time. This could be done in a separate patch.
Thank you for pointing this out! I'll open another JIRA and dig into that.
{quote}Somewhat orthogonal to your pull (but maybe building on top of it), do you think it'd
be possible to prevent a query from running that's "too expensive" (assuming "too expensive"
would be identified by a config property)?
Maybe. But users should be well aware that the costs are not accurate and they do not correspond
to a certain amount of time. The absolute value of the cost doesn't make so much sense as
the difference between the values of alternative plans generated from the same query. Besides,
consider a QueryPlan consisting of a mix of operators, each of which has a different weight
in cost evaluation, so it would be hard for users to figure out a proper configuration. A
probably more realistic approach here might be to set a configurable "limit" for specific
operators. For example, we know that some queries timeout during sorting if the dataset is
too large, so when calculating the cost for order-by (or sometimes client-side order-by),
we'd just know. Another example is how we handle hash-joins right now: when it's over the
limit, we just say it's too expensive (represented by the "highest" cost).

> Base hash versus sort merge join decision on cost
> -------------------------------------------------
>                 Key: PHOENIX-1556
>                 URL:
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: Maryann Xue
>            Priority: Major
>              Labels: CostBasedOptimization
>         Attachments: PHOENIX-1556.patch
> At compile time, we know how many guideposts (i.e. how many bytes) will be scanned for
the RHS table. We should, by default, base the decision of using the hash-join verus many-to-many
join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables being
joined are already ordered by the join key. In that case, it's better to always use the sort
merge join.

This message was sent by Atlassian JIRA

View raw message