hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Feng Honghua (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10679) Both clients get wrong scan results if the first scanner expires and the second scanner is created with the same scannerId on the same region
Date Sat, 08 Mar 2014 07:05:45 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13924756#comment-13924756
] 

Feng Honghua commented on HBASE-10679:
--------------------------------------

bq.When the AtomicLong hits the max, it goes negative which should be fine since we are toString
the value. It then goes down all the ways around the zero so all should be good. Nice one
Honghua. I know its only a few lines but probably took a lot longer than that to figure it
out.
# Yes, though the final fix is pretty straightforward, the scenario and condition triggering
the bug is quite tricky and not that easy to comprehend and figure out.
# Since the scannerId is returned as long rather than string to client for further consequent
scan requests, and -1 is deemed as invalid scannerId, so negative scannerId isn't acceptable/desirable.
But a back-of-the-envelope calculation can goes like this: the count of positive long values
are 2^63 = 9223372036854775808, the scannerId is per regionserver instance and won't span
different regionserver process lifecycles, 1000 years = 1000 * 365 * 24 * 60 * 60 = 31536000000
seconds, scannerId will be generated/used most quickly if all requests are read/scan, and
read/scan QPS should be 9223372036854775808 / 31536000000 = 292471208 for scannerId to reach
max and then go negative, considering it's almost impossible for a regionserver process to
live as long as 1000 years without downtime, and 272471208 is also an too big read/scan QPS
for regionserver to serve, we can safely overlook the possibility for scanerId to be negative.

bq.Same test failed twice in a row. Want to take a looksee...The tests make output. You can
navigate some if you click on the above links. You might see something in the output that
you don't see locally
OK, I'll check. Thanks for reminder

> Both clients get wrong scan results if the first scanner expires and the second scanner
is created with the same scannerId on the same region
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10679
>                 URL: https://issues.apache.org/jira/browse/HBASE-10679
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>            Reporter: Feng Honghua
>            Assignee: Feng Honghua
>            Priority: Critical
>         Attachments: HBASE-10679-trunk_v1.patch, HBASE-10679-trunk_v2.patch, HBASE-10679-trunk_v2.patch
>
>
> The scenario is as below (both Client A and Client B scan against Region R)
> # A opens a scanner SA on R, the scannerId is N, it successfully get its first row "a"
> # SA's lease expires and it's removed from scanners
> # B opens a scanner SB on R, the scannerId is N too. it successfully get its first row
"m"
> # A issues its second scan request with scannerId N, regionserver finds N is valid scannerId
and the region matches too. (since the region is always online on this regionserver and both
two scanners are against it), so it executes scan request on SB, returns "n" to A -- wrong!
(get data from other scanner, A expects row something like "b" that follows "a")
> # B issues its second scan request with scannerId N, regionserver also thinks it's valid,
and executes scan on SB, return "o" to B -- wrong! (should return "n" but "n" has been scanned
out by A just now)
> The consequence is both clients get wrong scan results:
> # A gets data from scanner created by other client, its own scanner has expired and removed
> # B misses data which should be gotten but has been wrongly scanned out by A
> The root cause is scannerId generated by regionserver can't be guaranteed unique within
regionserver's whole lifecycle, *there is only guarantee that scannerIds of scanners that
are currently still valid (not expired) are unique*, so a same scannerId can present in scanners
again after a former scanner with this scannerId expires and has been removed from scanners.
And if the second scanner is against the same region, the bug arises.
> Theoretically, the possibility of above scenario should be very rare(two consecutive
scans on a same region from two different clients get a same scannerId, and the first expires
before the second is created), but it does can happen, and once it happens, the consequence
is severe(all clients involved get wrong data), and should be extremely hard to diagnose/debug



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message