ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Govorukhin (Jira)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-12127) WAL writer may close file IO with unflushed changes when MMAP is disabled
Date Fri, 30 Aug 2019 08:50:00 GMT
Dmitriy Govorukhin created IGNITE-12127:

             Summary: WAL writer may close file IO with unflushed changes when MMAP is disabled
                 Key: IGNITE-12127
                 URL: https://issues.apache.org/jira/browse/IGNITE-12127
             Project: Ignite
          Issue Type: Bug
            Reporter: Dmitriy Govorukhin
            Assignee: Dmitriy Govorukhin

Most likely the issue manifests itself as the following critical error:
2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : Critical system
error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException:
Failed to write buffer.]]
org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to write
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444)
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) [ignite-core-2.5.7.jar!/:2.5.7]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
Caused by: java.nio.channels.ClosedChannelException: null
    at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) ~[na:1.8.0_201]
    at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) ~[na:1.8.0_201]
    at org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
    at org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
    at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
    at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437)
    ... 3 common frames omitted

It appears that there following sequence is possible:
 * Thread A attempts to log a large record which does not fit segment, {{addRecord}} fails
and the thread A starts segment rollover. I successfully runs {{flushOrWait(null)}} and gets
de-scheduled before adding switch segment record
 * Thread B attempts to log another record, which fits exactly till the end of the current
segment. The record is added to the buffer
 * Thread A resumes and fails to add the switch segment record. No flush is performed and
the thread immediately proceeds for wal-writer close
 * WAL writer thread wakes up, sees that there is a CLOSE request, closes the file IO and
immediately proceeds to write unflushed changes causing the exception.

Unconditional flush after switch segment record write should fix the issue.

Besides the bug itself, I suggest the following changes to the {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}}
in earlier versions):
 * There is an {{fsync(filePtr)}} call inside {{close()}}; however, {{fsync()}} checks the
{{stop}} flag (which is set inside {{close}}) and returns immediately after {{flushOrWait()}}
if the flag is set - this is very confusing. After all, the {{close()}} itself explicitly
calls {{force}} after flush
 * There is an ignored IO exception in mmap mode - this should be propagated to the failure
 * In WAL writer, we check for file CLOSE and then attemp to write to (possibly) the same
write handle - write should be always before close
 * In WAL writer, there are racy reads of current handle - it would be better if we read the
current handle once and then operate on it during the whole loop iteration

This message was sent by Atlassian Jira

View raw message