phoenix-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-5239) Send persistent subquery cache to all regionservers
Date Fri, 03 May 2019 18:06:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832705#comment-16832705
] 

Josh Elser commented on PHOENIX-5239:
-------------------------------------

{quote}The problem is that leads to unpredictability of response time.
{quote}
Gotcha. So, your persistent query is less effective because you still have variance in 'deploying'
the results to the RS.
{quote} 

I would see this as the lesser of two evils when it's anticipated the subquery will end up
being used on most of the regionservers. What's your opinion of adding a config option to
toggle this behavior?
{quote}
I think I lean towards Lars' opinions as well – I don't like it, but I won't veto the change.
A hint which indicates "cache on all servers" (instead of "cache on necessary servers") strikes
me as the best middle-ground. People have to opt-in to it, but it's still usable for you without
much pain. However, I would caution that this does _not_ work for multi-tenant installations.
If you and I are each trying to cache a query on all RS which fill the available cache, we'll
just stomp on each other.

I believe my bigger fear is that this is just one of more features in which you try to make
RegionServers look like a distributed memory caching layer. RegionServers are definitely not
built for such a thing (Java isn't great at keeping large hunks of memory resident). If this
is something long-term you want to use, we may be better off trying to use a Redis or Memcached
instead of keeping it in the RegionServer. Having a distributed filesystem behind HBase is
something we can use, although, for large numbers of RegionServers, we might ourselves in
a case where we have a thundering-herd going to a single DataNode (when the blocks aren't
yet replicated to multiple DataNodes). In short, I think there's likely a better, long-term
architecture choice to find, but it would require some experimentation to see what would be
best (in both simplicity and effeciency) :). Good follow-on thoughts.

> Send persistent subquery cache to all regionservers
> ---------------------------------------------------
>
>                 Key: PHOENIX-5239
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5239
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: John Phillips
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> PHOENIX-4666 introduced a persistent subquery cache that allowed phoenix to cache the
results from an expensive subquery (enabled with a {{USE_PERSISTENT_CACHE}} query hint) to
speed up subsequent queries.
> More context is available on the PHOENIX-4666 ticket, but a quick example would be a
query like:
> {code:java}
> SELECT /*+ USE_PERSISTENT_CACHE */ *
>     FROM table1
>     JOIN (SELECT id_1 FROM large_table WHERE x = 10) expensive_result
>     ON table1.id_1 = expensive_result.id_2
> WHERE table1.id_1 = [some_id]
> {code}
> Where lots of queries are ran, differing only by {{some_id}}. Our usage involves first
running one query over phoenix to warm the cache (which takes ~20 seconds), then once complete,
allowing the live query to run which utilize the persistent subquery cache (~100ms).
> However, we noticed that when phoenix sends the cache to the regionservers, it looks
at {{some_id}} in the outer query to figure out which regionservers might contain {{table1.id_1
= [some_id]}} ([code here|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]).
This means that when we first start running the query, we'll inconsistently hit the cache
until it ends up being propagated to all the regionservers.
> Basically, we'd like to have some way to warm the subquery cache and ensure it's on all
the regionservers so subsequent queries will always find the cache. I think the simplest solution
might be updating the [if statement in ServerCacheClient#addServerCache|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]
to simply always send the cache to all the regionservers if it's a persistent subquery:
> {code:java}
> - if ( ! servers.contains(entry) &&
> -         keyRanges.intersectRegion(regionStartKey, regionEndKey,
> -                 cacheUsingTable.getIndexType() == IndexType.LOCAL)) {
> + boolean keyRangesIntersect = keyRanges.intersectRegion(regionStartKey, regionEndKey,
> +         cacheUsingTable.getIndexType() == IndexType.LOCAL);
> + if (!servers.contains(entry) && (keyRangesIntersect || usePersistentCache))
{
> {code}
> I tested this out, and it seems to work as expected. If it sounds like an acceptable
solution, I'd be happy to make an actual PR. Or, if anyone has any other suggestions on better
ways to handle this, it would be much appreciated.
> FYI [~jamestaylor], [~elserj], and [~maryannxue] since it looks like you three handled
most of the review on the [original persistent cache PR|https://github.com/apache/phoenix/pull/298]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message