phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcell Ortutay (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
Date Fri, 23 Mar 2018 23:02:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412221#comment-16412221
] 

Marcell Ortutay edited comment on PHOENIX-4666 at 3/23/18 11:01 PM:
--------------------------------------------------------------------

Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly simple, and
it can be expanded in follow-up patches. To start, here is what I would propose:
 # Use existing server cache, with option to keep around data past a single query. There would
be a new "keep around TTL" that sets the max time an entry is kept around. The inter-query
data may be evicted if space is needed. Data being used for a "live" query is track as such,
and is never evicted (keep current Exception behavior)
 # Subquery cache is triggered with a /+\* SUBQUERY_CACHE \*/ hint, and is only activated
if this hint is present. This hint also has an optional cache key suffix, eg.: /+\* SUBQUERY_CACHE('2018-03-23')
\*/ which can be used by the application to explicitly expire a cache, in case TTL does not
give enough control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple ranking could
be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global control, or a table
level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing HashCacheClient.serialize()
method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by end of the week
for initial review


was (Author: ortutay):
Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly simple, and
it can be expanded in follow-up patches. To start, here is what I would propose:
 # Use existing server cache, with option to keep around data past a single query. There would
be a new "keep around TTL" that sets the max time an entry is kept around. The inter-query
data may be evicted if space is needed. Data being used for a "live" query is track as such,
and is never evicted (keep current Exception behavior)
 # Subquery cache is triggered with a /*+ SUBQUERY_CACHE */ hint, and is only activated if
this hint is present. This hint also has an optional cache key suffix, eg.: /*+ SUBQUERY_CACHE('2018-03-23')
*/ which can be used by the application to explicitly expire a cache, in case TTL does not
give enough control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple ranking could
be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global control, or a table
level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing HashCacheClient.serialize()
method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by end of the week
for initial review

> Add a subquery cache that persists beyond the life of a query
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4666
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4666
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Marcell Ortutay
>            Assignee: Marcell Ortutay
>            Priority: Major
>
> The user list thread for additional context is here: [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> ----
> A Phoenix query may contain expensive subqueries, and moreover those expensive subqueries
may be used across multiple different queries. While whole result caching is possible at the
application level, it is not possible to cache subresults in the application. This can cause
bad performance for queries in which the subquery is the most expensive part of the query,
and the application is powerless to do anything at the query level. It would be good if Phoenix
provided a way to cache subquery results, as it would provide a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) expensive_result
ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it doesn't
change between queries. The rest of the query does because of the \{id} parameter. This means
the application can't cache it, but it would be good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data in this
"cache" is not persisted across queries. It is deleted after a TTL expires (30sec by default),
or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to provide
a patch with some guidance from Phoenix maintainers. We are currently putting together a design
document for a solution, and we'll post it to this Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message