flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-7757) RocksDB lock is too strict and can block snapshots in synchronous phase
Date Tue, 10 Oct 2017 10:24:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198477#comment-16198477

ASF GitHub Bot commented on FLINK-7757:

Github user aljoscha commented on a diff in the pull request:

    --- Diff: flink-contrib/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java
    @@ -618,19 +600,8 @@ public void releaseSnapshotResources() {
     				readOptions = null;
    -		}
    -		/**
    -		 * Drop the created snapshot if we have ben cancelled.
    -		 */
    -		public void dropSnapshotResult() {
    -			if (null != snapshotResultStateHandle) {
    -				try {
    -					snapshotResultStateHandle.discardState();
    --- End diff --
    Cleanup is now handled somewhere else?

> RocksDB lock is too strict and can block snapshots in synchronous phase
> -----------------------------------------------------------------------
>                 Key: FLINK-7757
>                 URL: https://issues.apache.org/jira/browse/FLINK-7757
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.2, 1.3.2
>            Reporter: Stefan Richter
>            Assignee: Stefan Richter
>            Priority: Blocker
>             Fix For: 1.4.0
> {{RocksDBKeyedStateBackend}} uses a lock to guard the db instance against disposal of
the native resources while some parallel threads might still access db, which might otherwise
lead to segfaults.
> Unfortunately, this locking is a bit to strict and can lead to situations where snapshots
block the pipeline. This can happen when a snapshot s1 is running and somewhere blocking in
IO while holding the guarding lock. A second snapshot s2 can be triggered in parallel and
requires to hold the lock in the synchronous part to get a snapshot from db. As s1 is still
holding on to the lock, s2 can block here and stop the operator from processing further elements.
> A simple solution could remove lock acquisition from the synchronous phase, because both,
synchronous phase and disposing the backend are only allowed to be triggered from the thread
that also drives element processing.
> A better solution would be to remove long sections under the lock all together, because
as of now they will always prevent the possibility of parallel checkpointing. I think a guard
for the rocksdb instance would be sufficient that blocks disposal for as long as there are
still clients potentially accessing the instance in parallel. This could be realized by keeping
a synchronized counter for active clients and block disposal until the client count drops
to zero.
> This approach could also be integrated with triggering timers, which have always been
problematic in the disposal phase are currently unregulated. In the new model, they could
register as yet another client.

This message was sent by Atlassian JIRA

View raw message