accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4229) BatchWriter Locator cache out-of-sync when shared with tserver
Date Thu, 05 May 2016 01:31:12 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271772#comment-15271772
] 

ASF GitHub Bot commented on ACCUMULO-4229:
------------------------------------------

GitHub user dhutchis opened a pull request:

    https://github.com/apache/accumulo/pull/97

    ACCUMULO-4229 BatchWriter Locator cache out-of-sync when shared with tserver

    This version targets master. Will commit shortly.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dhutchis/accumulo ACCUMULO-4229

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/97.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #97
    
----
commit 618ec9251d868ce08a3c43c0edf36296d37d8091
Author: Dylan Hutchison <dhutchis@cs.washington.edu>
Date:   2016-04-23T08:02:01Z

    ACCUMULO-4229 Add breaking test to reproduce sync problem
    
    Add two tests in BatchWriterInTabletServerIT.
    The first passes and establishes a baseline.
    The second fails, proving the issue.
    
    I added an extra iterator and a utility class for the IT.
    These are in the test package, so they do not get packaged into normal builds.

commit ebfa4bba15d054f554f5040e7dc6c6b668b58a9b
Author: Dylan Hutchison <dhutchis@cs.washington.edu>
Date:   2016-04-23T08:33:24Z

    ACCUMULO-4229 SyncingTabletLocator ensures caching clients align with tserver
    
    The previously failing test case now passes.
    
    A boolean indicator variable in TabletLocator
    is set to false when the static TabletLocator cache is cleared.
    Client writers and readers should check the indicator
    before using a locally cached Locator.
    
    A SyncingTabletLocator automatically fetches the current TabletLocator
    when the cached TabletLocator becomes outdated (when isValid flips false).
    Clients now stay in sync with the static TabletLocator cache
    through the use of SyncingTabletLocator, which
    TimeoutTabletLocator now extends.
    
    Variables assumed invariant for SyncingTabletLocator are now final.
    This does not affect semantics.
    
    The new design allows for TabletServerBatchWriter to use its local Locator cache
    more aggressively than it used to.
    
    The invalidateCache() methods of SyncingTabletLocator defensively sync,
    guarding against goofy code that might use more than one TabletLocator.
    
    It may be possible to replace the static locators collection with a Guava class.
    This is not done here for simplicity.
    
    Closes apache/accumulo#96

----


> BatchWriter Locator cache out-of-sync when shared with tserver
> --------------------------------------------------------------
>
>                 Key: ACCUMULO-4229
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4229
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.6.5, 1.7.1
>            Reporter: Dylan Hutchison
>            Assignee: Dylan Hutchison
>
> BatchWriters that run a long time have write rates that sometimes mysteriously decrease
after the table it is writing to goes through a major compaction or a split.  The decrease
can be as bad as reducing throughput to 0.
> This was first first mentioned in this [email thread|https://mail-archives.apache.org/mod_mbox/accumulo-user/201406.mbox/%3CCAMz+DuvmmHegOn9EJeHR9H_rRpP50L2QZ53BbdruVO0pirArQw@mail.gmail.com%3E]
for major compactions.  
> I discovered this in this [email thread|https://mail-archives.apache.org/mod_mbox/accumulo-dev/201604.mbox/%3CCAPx%3DJkaY7fVh-U0O%2Bysx2d98LOGMcA4oEQOYgoPxR-0em4hdvg%40mail.gmail.com%3E]
for splits.  See the thread for some log messages.
> I turned on TRACE logs and I think I pinned it down: the TabletLocator cached by a BatchWriter
gets out of sync with the static cache of TabletLocators.
> # The TabletServerBatchWriter caches a TabletLocator from the static collection of TabletLocators
when it starts writing.  Suppose it is writing to tablet T1.
> # The TabletServerBatchWriter uses its locally cached TabletLocator inside its `binMutations`
method for its entire lifespan; this cache is never refreshed or updated to sync up with the
static collection of TabletLocators.
> # Every hour, the static collection of TabletLocators clears itself.  The next call to
get a TabletLocator from the static collection allocates a new TabletLocator.  Unfortunately,
the TabletServerBatchWriter does not reflect this change and continues to use the old, locally
cached TabletLocator.
> # Tablet T1 splits into T2 and T3, which closes T1.  As such, it no longer exists and
the tablet server that receives the entries meant to go to T1 all fail to write because T1
is closed.
> # The TabletServerBatchWriter receives the response from the tablet server that all entries
failed to write.  It invalidates the cache of the *new* TabletLocator obtained from the static
collection of TabletLocators.  The old TabletLocator that is cached locally does not get invalidated.
> # The TabletServerBatchWriter re-queues the failed entries and tries to write them to
the same closed tablet T1, because it is still looking up tablets using the old TabletLocator.
> This behavior subsumes the circumstances William wrote about in the thread he mentioned.
 The problem would occur as a result of either splits or major compactions.  It would only
stop the BatchWriter if its entire memory filled up with writes to the same tablet that was
closed as a result of a majc or split; otherwise it would just slow down the BatchWriter by
failing to write some number of entries with every RPC.
> There are a few solutions we can think of.  
> # Not have the MutationWriter inside the TabletServerBatchWriter locally cache TabletLocators.
 I suspect this was done for performance reasons, so it's probably not a good solution. 
> # Have all the MutationWriters clear their cache at the same time the static TabletLocator
cache clears.  We could store a reference to the Map that each MutationWriter has inside a
static synchronized WeakHashMap.  The only time the weak map needs to be accessed is:
> ## When a MutationWriter is constructed (from constructing a TabletServerBatchWriter),
add its new local TabletLocator cache to the weak map.
> ## When the static TabletLocator cache is cleared, also clear every map in the weak map.
> # Another solution is to make the invalidate calls on the local TabletLocator cache rather
than the global static one.  If we go this route we should double check the idea to make sure
it does not impact the correctness of any other pieces of code that use the cache. 
> # Perhaps the simplest solution is to put an extra Boolean variable inside the Locators
indicating whether they are valid.  When they are cleared, their Boolean variables set to
false.  Before a client uses a locator from cache, it checks its Boolean indicator.
> The TimeoutTabletLocator does not help when no timeout is set on the BatchWriter (the
default behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message