phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
Date Thu, 22 Mar 2018 03:21:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408984#comment-16408984
] 

James Taylor edited comment on PHOENIX-4666 at 3/22/18 3:20 AM:
----------------------------------------------------------------

I like the simplicity of your design, [~ortutay], in using the hash as the cache ID. You could
just assume that the cache is already available and react to the exception you get back by
generating the cache if it's not. That way you'd need no mapping at all (and no central place
to check if the cache ID maps to an existing cache). That flow is already there, but you'd
need to add the logic to generate the cache as the current code assumes that it has already
built the cache (i.e. this handles the situation in which a region splits and the new RS doesn't
have the cache yet).

Some considerations:
- at a minimum, we could have a global config for the TTL of the cache when this feature is
enabled (so that it'd be a different config than the standard TTL config).
- at the finest granularity, you could even create a new hint that specifies the TTL so you
could specify it per query.
- we'd want to make it clear in documentation that the cache data would be stale once generated
(until the TTL expires it).
- might consider having a new table level property on which this feature could be enabled
(or a table-specific TTL could be specified)
- might consider in the future using a format like Apache Arrow to represent the hash join
cache data
- might consider off heap memory for hash join cache
- persistent cache could be future work (or you could put interfaces in place that could be
replaced)


was (Author: jamestaylor):
I like the simplicity of your design, [~ortutay], in using the hash as the cache ID. You could
just assume that the cache is already available and react to the exception you get back by
generating the cache if it's not. That way you'd need no mapping at all (and no central place
to check if the cache ID maps to an existing cache). That flow is already there, but you'd
need to add the logic to generate the cache as the current code assumes that it has already
built the cache (i.e. this handles the situation in which a region splits and the new RS doesn't
have the cache yet).

Some considerations:
- at a minimum, we could have a global config for the TTL of the cache when this feature is
enabled (so that it'd be a different config than the standard TTL config).
- at the finest granularity, you could even create a new hint that specifies the TTL so you
could specify it per query.
- we'd want to make it clear in documentation that the cache data would be stale once generated
(until the TTL expires it).
- might consider having a new table level property on which this feature could be enabled
(or a table-specific TTL could be specified)

> Add a subquery cache that persists beyond the life of a query
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4666
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4666
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Marcell Ortutay
>            Priority: Major
>
> The user list thread for additional context is here: [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> ----
> A Phoenix query may contain expensive subqueries, and moreover those expensive subqueries
may be used across multiple different queries. While whole result caching is possible at the
application level, it is not possible to cache subresults in the application. This can cause
bad performance for queries in which the subquery is the most expensive part of the query,
and the application is powerless to do anything at the query level. It would be good if Phoenix
provided a way to cache subquery results, as it would provide a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) expensive_result
ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it doesn't
change between queries. The rest of the query does because of the \{id} parameter. This means
the application can't cache it, but it would be good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data in this
"cache" is not persisted across queries. It is deleted after a TTL expires (30sec by default),
or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to provide
a patch with some guidance from Phoenix maintainers. We are currently putting together a design
document for a solution, and we'll post it to this Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message