ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Govorukhin (Jira)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-12127) WAL writer may close file IO with unflushed changes when MMAP is disabled
Date Thu, 05 Sep 2019 13:10:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923423#comment-16923423

Dmitriy Govorukhin commented on IGNITE-12127:

Cherry-picked to ignite-2.7.6 402c9450dafbb201708f66d8bdab0ade0b87bd4f

> WAL writer may close file IO with unflushed changes when MMAP is disabled
> -------------------------------------------------------------------------
>                 Key: IGNITE-12127
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12127
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Dmitriy Govorukhin
>            Assignee: Dmitriy Govorukhin
>            Priority: Critical
>             Fix For: 2.7.6
>          Time Spent: 20m
>  Remaining Estimate: 0h
> Most likely the issue manifests itself as the following critical error:
> {code}
> 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : Critical
system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException:
Failed to write buffer.]]
> org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to write
>     at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444)
>     at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249)
>     at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) [ignite-core-2.5.7.jar!/:2.5.7]
>     at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> Caused by: java.nio.channels.ClosedChannelException: null
>     at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) ~[na:1.8.0_201]
>     at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) ~[na:1.8.0_201]
>     at org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
>     at org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
>     at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
>     at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437)
>     ... 3 common frames omitted
> {code}
> It appears that there following sequence is possible:
>  * Thread A attempts to log a large record which does not fit segment, {{addRecord}}
fails and the thread A starts segment rollover. I successfully runs {{flushOrWait(null)}}
and gets de-scheduled before adding switch segment record
>  * Thread B attempts to log another record, which fits exactly till the end of the current
segment. The record is added to the buffer
>  * Thread A resumes and fails to add the switch segment record. No flush is performed
and the thread immediately proceeds for wal-writer close
>  * WAL writer thread wakes up, sees that there is a CLOSE request, closes the file IO
and immediately proceeds to write unflushed changes causing the exception.
> Unconditional flush after switch segment record write should fix the issue.
> Besides the bug itself, I suggest the following changes to the {{FileWriteHandleImpl}}
({{FileWriteAheadLogManager}} in earlier versions):
>  * There is an {{fsync(filePtr)}} call inside {{close()}}; however, {{fsync()}} checks
the {{stop}} flag (which is set inside {{close}}) and returns immediately after {{flushOrWait()}}
if the flag is set - this is very confusing. After all, the {{close()}} itself explicitly
calls {{force}} after flush
>  * There is an ignored IO exception in mmap mode - this should be propagated to the failure
>  * In WAL writer, we check for file CLOSE and then attemp to write to (possibly) the
same write handle - write should be always before close
>  * In WAL writer, there are racy reads of current handle - it would be better if we read
the current handle once and then operate on it during the whole loop iteration

This message was sent by Atlassian Jira

View raw message