From dev-return-47312-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Fri Aug 30 08:50:02 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 60004180660 for ; Fri, 30 Aug 2019 10:50:02 +0200 (CEST) Received: (qmail 85433 invoked by uid 500); 30 Aug 2019 08:50:01 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 85421 invoked by uid 99); 30 Aug 2019 08:50:01 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Aug 2019 08:50:01 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id AD750E00A9 for ; Fri, 30 Aug 2019 08:50:00 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 2A33F7807B0 for ; Fri, 30 Aug 2019 08:50:00 +0000 (UTC) Date: Fri, 30 Aug 2019 08:50:00 +0000 (UTC) From: "Dmitriy Govorukhin (Jira)" To: dev@ignite.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (IGNITE-12127) WAL writer may close file IO with unflushed changes when MMAP is disabled MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Dmitriy Govorukhin created IGNITE-12127: ------------------------------------------- Summary: WAL writer may close file IO with unflushed changes when MMAP is disabled Key: IGNITE-12127 URL: https://issues.apache.org/jira/browse/IGNITE-12127 Project: Ignite Issue Type: Bug Reporter: Dmitriy Govorukhin Assignee: Dmitriy Govorukhin Most likely the issue manifests itself as the following critical error: {code} 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Failed to write buffer.]] org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to write buffer. at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444) [ignite-core-2.5.7.jar!/:2.5.7] at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249) [ignite-core-2.5.7.jar!/:2.5.7] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) [ignite-core-2.5.7.jar!/:2.5.7] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201] Caused by: java.nio.channels.ClosedChannelException: null at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) ~[na:1.8.0_201] at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) ~[na:1.8.0_201] at org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48) ~[ignite-core-2.5.7.jar!/:2.5.7] at org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41) ~[ignite-core-2.5.7.jar!/:2.5.7] at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111) ~[ignite-core-2.5.7.jar!/:2.5.7] at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437) [ignite-core-2.5.7.jar!/:2.5.7] ... 3 common frames omitted {code} It appears that there following sequence is possible: * Thread A attempts to log a large record which does not fit segment, {{addRecord}} fails and the thread A starts segment rollover. I successfully runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment record * Thread B attempts to log another record, which fits exactly till the end of the current segment. The record is added to the buffer * Thread A resumes and fails to add the switch segment record. No flush is performed and the thread immediately proceeds for wal-writer close * WAL writer thread wakes up, sees that there is a CLOSE request, closes the file IO and immediately proceeds to write unflushed changes causing the exception. Unconditional flush after switch segment record write should fix the issue. Besides the bug itself, I suggest the following changes to the {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions): * There is an {{fsync(filePtr)}} call inside {{close()}}; however, {{fsync()}} checks the {{stop}} flag (which is set inside {{close}}) and returns immediately after {{flushOrWait()}} if the flag is set - this is very confusing. After all, the {{close()}} itself explicitly calls {{force}} after flush * There is an ignored IO exception in mmap mode - this should be propagated to the failure handler * In WAL writer, we check for file CLOSE and then attemp to write to (possibly) the same write handle - write should be always before close * In WAL writer, there are racy reads of current handle - it would be better if we read the current handle once and then operate on it during the whole loop iteration -- This message was sent by Atlassian Jira (v8.3.2#803003)