hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-3787) Increment is non-idempotent but client retries RPC
Date Fri, 19 Apr 2013 02:21:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635960#comment-13635960
] 

Sergey Shelukhin edited comment on HBASE-3787 at 4/19/13 2:20 AM:
------------------------------------------------------------------

Well, I've spent today writing some code, and then we discussed it with [~enis]. There is
a huge number of special cases all over the place...

First, on which nonces to use. Unfortunately my proposed approach is not as simple as it seems
due to special cases :)
In case of hash nonces:
* you have an epic hashmap of all nonces over the past expiration interval (say, one hour)
* on common successful operation path, for every op you do a putIfAbsent into this map, and
later take an uncontested lock to check for waiting conflicts
* cleanup iterates the map and removes expired entries
* code is simple
* *retries over expiration time will result in duplicate operations*
* increasing expiration time is limited by total (map) size
In case of sequential nonces:
* you have a small map by client, some smaller nonce structures inside; easier to cleanup
so probably smaller/much smaller in total
* however, in common path, you do small hashmap lookup, one smaller structure (say another
hashmap) putIfAbsent, take the same uncontested lock as in case 1, and do 2 interlocked ops
* cleanup is still very simple
* however, to make cleanup and main path simple, the code for starting/ending nonce operation
has to have a lot of interlocked/volatile/etc. cleverness which will almost never execute,
to handle special cases
* *retries over expiration time #1 will result in rejection - no dup, /importance of which
depends on WAL design below/*
* increasing expiration time #1 is limited by total size
* *retries over expiration time #2 will result in duplicate change*
* increasing expiration time #2 is only limited by number of *clients* (not nonces), so it's
a much larger margin of safety (which may or may not be worth it)
Finally, to make full use of range collapsing as suggested above, you need to factor region
into clientId on client (not on server, for server clientId is opaque), so that the sequence
of numbers following each other goes into the same region.
That will allow one to use structure better than hashmap for nonces above, reduce memory footprint
a lot, and increase the time for nonce expiration.

h5. TL;DR1 I am not sure the complexity of the sequential nonces is worth it, at least for
the first cut.

Then; there are some easy-to-handle special cases like splits and merges.

Main problem for any case is WAL recovery. First, we will have to read nonces from entire
WAL, not just the records we recover, because otherwise nonces won't work for records that
got flushed (we don't recover WAL below some watermark, for records are already in store).
Second, even if we do read records for entire WAL, WAL can go away very quickly after all
records make it to FS, so we won't have nonces from it, at all.
One option is to have is to keep WAL around for nonce recovery period.
Alternatively, we can have separate additional "log" file for nonces. It will just contain
bunch of numbers. Flushing it will not be on main path - because WAL itself also needs to
contain nonces (for replication at least), we can flush the nonce log only before memstore
flush.
So when we recover, for a given nonce we will either see it in the WAL (if it was never flushed
to disk, it will be part of recovery), or we'll see it as part of the nonce log (or both occasionally,
which doesn't matter).
Nonce log can be rolled independently and nuked after a file is at least expiration-time old.

A radically different solution that [~enis] proposed is to output increments and appends as
separate markers (like delete markers), containing nonces as Cell tags or shadow columns,
and coalesce them on reads, and during compactions after some time.
When we coalesce them to get final value we will throw away the extra ones.
This way we get rid of all the above complexity because the nonce management is just part
of normal KV management.
However, we may introduce a lot of other special case around out of order puts/deletes, number
of versions to keep (increments/appends will need special accounting to keep version semantics).
Plus coalescing the value from some often-incremented field may be expensive.
It will also allow us to support out-of-order increments and appends! Just kidding.

h5. TL;DR2 My current plan will be as such. I will wait until Mon-Tue for feedback, doing
other things (or until enough feedback accumulates :)). Then, I will stash my sequential nonces
code, and do the simplest hash nonces patch possible, including sending a summary of nonces
to server during WAL recovery from whatever WAL we are currently reading, including below
watermark, without actually replaying the KVs. It will not be bulletproof, but a first step
if there are no objections :)

                
      was (Author: sershe):
    Well, I've spent today writing some code, and then we discussed it with [~enis]. There
is a huge number of special cases all over the place...

First, on which nonces to use. Unfortunately my proposed approach is not as simple as it seems
due to special cases :)
In case of hash nonces:
  you have an epic hashmap of all nonces over the past expiration interval (say, one hour)
  on common successful operation path, for every op you do a putIfAbsent into this map, and
later take an uncontested lock to check for waiting conflicts
  cleanup iterates the map and removes expired entries
  code is simple
  *retries over expiration time will result in duplicate operations*
  increasing expiration time is limited by total (map) size
In case of sequential nonces:
  you have a small map by client, some smaller nonce structures inside; easier to cleanup
so probably smaller/much smaller in total
  however, in common path, you do small hashmap lookup, one smaller structure (say another
hashmap) putIfAbsent, take the same uncontested lock as in case 1, and do 2 interlocked ops
  cleanup is still very simple
  however, to make cleanup and main path simple, the code for starting/ending nonce operation
has to have a lot of interlocked/volatile/etc. cleverness which will almost never execute,
to handle special cases
  *retries over expiration time #1 will result in rejection - no dup, /importance of which
depends on WAL design below/*
  increasing expiration time #1 is limited by total size
  *retries over expiration time #2 will result in duplicate change*
  increasing expiration time #2 is only limited by number of *clients* (not nonces), so it's
a much larger margin of safety (which may or may not be worth it)
Finally, to make full use of range collapsing as suggested above, you need to factor region
into clientId on client (not on server, for server clientId is opaque), so that the sequence
of numbers following each other goes into the same region.
That will allow one to use structure better than hashmap for nonces above, reduce memory footprint
a lot, and increase the time for nonce expiration.

h5. TL;DR1 I am not sure the complexity of the sequential nonces is worth it, at least for
the first cut.

Then; there are some easy-to-handle special cases like splits and merges.

Main problem for any case is WAL recovery. First, we will have to read nonces from entire
WAL, not just the records we recover, because otherwise nonces won't work for records that
got flushed (we don't recover WAL below some watermark, for records are already in store).
Second, even if we do read records for entire WAL, WAL can go away very quickly after all
records make it to FS, so we won't have nonces from it, at all.
One option is to have is to keep WAL around for nonce recovery period.
Alternatively, we can have separate additional "log" file for nonces. It will just contain
bunch of numbers. Flushing it will not be on main path - because WAL itself also needs to
contain nonces (for replication at least), we can flush the nonce log only before memstore
flush.
So when we recover, for a given nonce we will either see it in the WAL (if it was never flushed
to disk, it will be part of recovery), or we'll see it as part of the nonce log (or both occasionally,
which doesn't matter).
Nonce log can be rolled independently and nuked after a file is at least expiration-time old.

A radically different solution that [~enis] proposed is to output increments and appends as
separate markers (like delete markers), containing nonces as Cell tags or shadow columns,
and coalesce them on reads, and during compactions after some time.
When we coalesce them to get final value we will throw away the extra ones.
This way we get rid of all the above complexity because the nonce management is just part
of normal KV management.
However, we may introduce a lot of other special case around out of order puts/deletes, number
of versions to keep (increments/appends will need special accounting to keep version semantics).
Plus coalescing the value from some often-incremented field may be expensive.
It will also allow us to support out-of-order increments and appends! Just kidding.

h5. TL;DR2 My current plan will be as such. I will wait until Mon-Tue for feedback, doing
other things (or until enough feedback accumulates :)). Then, I will stash my sequential nonces
code, and do the simplest hash nonces patch possible, including sending a summary of nonces
to server during WAL recovery from whatever WAL we are currently reading, including below
watermark, without actually replaying the KVs. It will not be bulletproof, but a first step
if there are no objections :)

                  
> Increment is non-idempotent but client retries RPC
> --------------------------------------------------
>
>                 Key: HBASE-3787
>                 URL: https://issues.apache.org/jira/browse/HBASE-3787
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 0.94.4, 0.95.2
>            Reporter: dhruba borthakur
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>             Fix For: 0.95.1
>
>         Attachments: HBASE-3787-partial.patch
>
>
> The HTable.increment() operation is non-idempotent. The client retries the increment
RPC a few times (as specified by configuration) before throwing an error to the application.
This makes it possible that the same increment call be applied twice at the server.
> For increment operations, is it better to use HConnectionManager.getRegionServerWithoutRetries()?
Another  option would be to enhance the IPC module to make the RPC server correctly identify
if the RPC is a retry attempt and handle accordingly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message