phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (PHOENIX-2940) Remove STATS RPCs from rowlock
Date Fri, 17 Jun 2016 16:54:05 GMT


Josh Elser commented on PHOENIX-2940:

bq. +1 to add ConnectionQueryServices.invalidateStats(tableName). I still don't completely
understand why we need that code, though, given that the underlying cache will fetch the stats
if they're not cached. When do we "fail to acquire stats"? Is it for when we collect stats
synchronously (which is really a test-only case)? If that's the case, then how about just
invalidating the stats before making the server-side call to update them?

Testing would definitely benefit from such a method. I'm thinking that we could fail to acquire
stats for transient "unhealthy" HBase reasons (e.g. system.stats region(s) isn't online) which
would cache an EMPTY_STATS instance. It would correct itself after {{phoenix.stats.updateFrequency}}
(15mins) though. I'm unable to think of a reason the user would have to ask us to invalidate
them at the moment before that timeout hits.

I have a little test infra with JMeter to do some concurrency testing on the read side. Will
do some quick comparisons and aim to try to commit later today.

> Remove STATS RPCs from rowlock
> ------------------------------
>                 Key: PHOENIX-2940
>                 URL:
>             Project: Phoenix
>          Issue Type: Improvement
>         Environment: HDP 2.3 + Apache Phoenix 4.6.0
>            Reporter: Nick Dimiduk
>            Assignee: Josh Elser
>             Fix For: 4.8.0
>         Attachments: PHOENIX-2940.001.patch, PHOENIX-2940.002.patch, PHOENIX-2940.003.patch
> We have an unfortunate situation wherein we potentially execute many RPCs while holding
a row lock. This is problem is discussed in detail on the user list thread ["Write path blocked
by MetaDataEndpoint acquiring region lock"|].
During some situations, the [MetaDataEndpoint|]
coprocessor will attempt to refresh it's view of the schema definitions and statistics. This
involves [taking a rowlock|],
executing a scan against the [local region|],
and then a scan against a [potentially remote|]
statistics table.
> This issue is apparently exacerbated by the use of user-provided timestamps (in my case,
the use of the ROW_TIMESTAMP feature, or perhaps as in PHOENIX-2607). When combined with other
issues (PHOENIX-2939), we end up with total gridlock in our handler threads -- everyone queued
behind the rowlock, scanning and rescanning SYSTEM.STATS. Because this happens in the MetaDataEndpoint,
the means by which all clients refresh their knowledge of schema, gridlock in that RS can
effectively stop all forward progress on the cluster.

This message was sent by Atlassian JIRA

View raw message