hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Appy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-18432) Prevent clock from getting stuck after update()
Date Sat, 22 Jul 2017 03:24:00 GMT

     [ https://issues.apache.org/jira/browse/HBASE-18432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Appy updated HBASE-18432:
    Attachment: HBASE-18432.HBASE-14070.HLC.001.patch

> Prevent clock from getting stuck after update()
> -----------------------------------------------
>                 Key: HBASE-18432
>                 URL: https://issues.apache.org/jira/browse/HBASE-18432
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Appy
>            Assignee: Appy
>         Attachments: HBASE-18432.HBASE-14070.HLC.001.patch
> There were a [bunch of problems|https://issues.apache.org/jira/browse/HBASE-14070?focusedCommentId=16094013&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16094013]
(also copied below) with clock getting stuck after call to update() until it's own system
time caught up.
> Proposed solution is, keeping track of skew separately.
> -----
> PT = physical time, LT = logical time, ST = system time, X = don't care terms
> Note that in current implementation, we are passing master clock to RS in open/close
region request and RS clock to master in the responses. And they both update their own time
on receiving these request/response.
> Also, on receiving a clock ahead of its own, they update their own clock to its PT+LT,
and keep increasing LT till their own ST catches that PT.
> ----
> Problem 1: Logical time window too small.
> RS clock (10, X)
> Master clock (20, X)
> Master --request-> RS
> RS clock (20, X)
> While RS's physical java clock (which is backing up physical component of hlc clock)
will still take 10 sec to catch up, we'll keep incrementing logical component. That means,
in worst case, our logical clock window should be big enough to support all the events that
can happen in max skew time.
> The problem is, that doesn't seem to be the case. Our logical window is 1M events (20bits)
and max skew time is 30 sec, that results in 33k max write qps, which is quite low. We can
easily see 150k update qps per beefy server with 1k values.
> Even 22 bits won't be enough. We'll need minimum of 23 bits and 20 sec max skew time
to support ~420k max events per second in worst case clock skew.
> ----
> Problem 2: Cascading logical time increment.
> When more RS are involved say - 3 RS and 1 master. Let's say max skew is 30 sec.
> HLC Clocks (physical time, logical time): X = don't care
> RS1: (50, 100k)
> Master: (40, X)
> RS2: (30, X)
> RS3: (20, X) 
> [RS3's ST behind RS1's by 30 sec.]
> RS1 replies to master, sends it's clock (50,X).
> Master's clock (50, X). It'll be another 10 sec before it's own physical clock reaches
50, so HLC's PT will remain 50 for next 10 sec.
> Master --> RS2
> RS2's clock = (50, X).
> RS2 keeps incrementing LT on writes (since it's own PT is behind) for few seconds before
it replies back to master with (50, X+ few 100k).
> Master's clock = (50, X+ few 100k) [Since master's physical clock hasn't caught up yet,
note that it was 10 seconds behind, PT remains 50.].
> Master --> RS3
> RS3's clock (50, X+few 100k) 
> But RS3's ST is behind RS1's ST by 30 sec, which means it'll keep incrementing LT for
next 30 sec (unless it gets a newer clock from master).
> But the problem is, RS3 has much smaller LT window than actual 1M!!
> —
> Problem 3: Single bad RS clock crashing the cluster:
> If a single RS's clock is bad and a bit faster, it'll catch time and keep pulling master's
PT with it. If 'real time' is say 20, max skew time is 10, and bad RS is at time 29.9, it'll
pull master to 29.9 (via next response), and then any RS less than 19.9, i.e. just 0.1 sec
away from real time will die due to higher than max skew.
> This can bring whole clusters down!
> —
> Problem 4: Time jumps (not a bug, but more of a nuisance)
> Say a RS is behind master by 20 sec. On each communication from master, RS will update
its own PT to master's PT, and it'll remain that till RS's ST catches up. If there are frequent
communication from master, ST might never catch up and RS's PT will actually look like discrete
time jumps rather than continuous time.
> For eg. If master communicated with RS at times 30, 40, 50 (RSs corresponding times are
10, 20, 30), than all events on RS between time [10, 50] will be timestamped with either 30,
40 or 50.
> —

This message was sent by Atlassian JIRA

View raw message