Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8FFEB200D49 for ; Thu, 9 Nov 2017 14:50:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 8EEC51609E5; Thu, 9 Nov 2017 13:50:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B1C31160C02 for ; Thu, 9 Nov 2017 14:50:04 +0100 (CET) Received: (qmail 12046 invoked by uid 500); 9 Nov 2017 13:50:03 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 11929 invoked by uid 99); 9 Nov 2017 13:50:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Nov 2017 13:50:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 003471807B2 for ; Thu, 9 Nov 2017 13:50:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 1BB4kEpqOvR1 for ; Thu, 9 Nov 2017 13:50:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 5BD6E61183 for ; Thu, 9 Nov 2017 13:50:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E0AADE0E80 for ; Thu, 9 Nov 2017 13:50:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 69116240EE for ; Thu, 9 Nov 2017 13:50:00 +0000 (UTC) Date: Thu, 9 Nov 2017 13:50:00 +0000 (UTC) From: "Ricardo Bartolome (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-13999) Segfault during memtable flush MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 09 Nov 2017 13:50:05 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-13999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245673#comment-16245673 ] Ricardo Bartolome commented on CASSANDRA-13999: ----------------------------------------------- Hi [~beobal]. We just realised the debug.log fragment we provided initially is wrong, because it's related with a different stacktrace that we got in the meanwhile and we think it's related. So I did the following: * Deleted node_crashing_debug.log to avoid confussion * Upload flush_exception_debug_fragment.log.obfuscated which is what we get from our logging system (we no longer have the debug.log files. We'll custody them more carefully next time) In regards with the other segfault we suffered, which we think it's related and stacktrace is very similar to CASSANDRA-12590 * Upload cassandra-jvm-file-error-1509717499-pid10419.log.obfuscated * Upload compaction_exception_debug_fragment.obfuscated.log, which is the debug.log fragment that you saw initially. > Segfault during memtable flush > ------------------------------ > > Key: CASSANDRA-13999 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13999 > Project: Cassandra > Issue Type: Bug > Components: Local Write-Read Paths > Environment: * Cassandra 3.9 > * Oracle JDK 1.8.0_112 and 1.8.0_131 > * Kernel 4.9.43-17.38.amzn1.x86_64 and 3.14.35-28.38.amzn1.x86_64 > Reporter: Ricardo Bartolome > Priority: Critical > Attachments: cassandra-jvm-file-error-1509698372-pid16151.log.obfuscated, cassandra-jvm-file-error-1509717499-pid10419.log.obfuscated, cassandra_config.yaml, compaction_exception_debug_fragment.obfuscated.log, flush_exception_debug_fragment.obfuscated.log > > > We are getting segfaults on a production Cassandra cluster, apparently caused by Memtable flushes to disk. > {code} > Current thread (0x000000000cd77920): JavaThread "PerDiskMemtableFlushWriter_0:140" daemon [_thread_in_Java, id=28952, stack(0x00007f8b7aa53000,0x00007f8b7aa94000)] > {code} > Stack > {code} > Stack: [0x00007f8b7aa53000,0x00007f8b7aa94000], sp=0x00007f8b7aa924a0, free space=253k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > J 21889 C2 org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(Lorg/apache/cassandra/db/rows/UnfilteredRowIterator;)Lorg/apache/cassandra/db/RowIndexEntry; (361 bytes) @ 0x00007f8e9fcf75ac [0x00007f8e9fcf42c0+0x32ec] > J 22464 C2 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents()V (383 bytes) @ 0x00007f8e9f17b988 [0x00007f8e9f17b5c0+0x3c8] > j org.apache.cassandra.db.Memtable$FlushRunnable.call()Lorg/apache/cassandra/io/sstable/SSTableMultiWriter;+1 > j org.apache.cassandra.db.Memtable$FlushRunnable.call()Ljava/lang/Object;+1 > J 18865 C2 java.util.concurrent.FutureTask.run()V (126 bytes) @ 0x00007f8e9d3c9540 [0x00007f8e9d3c93a0+0x1a0] > J 21832 C2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @ 0x00007f8e9f16856c [0x00007f8e9f168400+0x16c] > J 6720 C1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V (9 bytes) @ 0x00007f8e9def73c4 [0x00007f8e9def72c0+0x104] > J 22079 C2 java.lang.Thread.run()V (17 bytes) @ 0x00007f8e9e67c4ac [0x00007f8e9e67c460+0x4c] > v ~StubRoutines::call_stub > V [libjvm.so+0x691d16] JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)+0x1056 > V [libjvm.so+0x692221] JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x321 > V [libjvm.so+0x6926c7] JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)+0x47 > V [libjvm.so+0x72da50] thread_entry(JavaThread*, Thread*)+0xa0 > V [libjvm.so+0xa76833] JavaThread::thread_main_inner()+0x103 > V [libjvm.so+0xa7697c] JavaThread::run()+0x11c > V [libjvm.so+0x927568] java_start(Thread*)+0x108 > C [libpthread.so.0+0x7de5] start_thread+0xc5 > {code} > For further details, we attached: > * JVM error file with all details > * cassandra config file (we are using offheap_buffers as memtable_allocation_method) > * some lines printed in debug.log when the JVM error file was created and process died > h5. Reproducing the issue > So far we have been unable to reproduce it. It happens once/twice a week on single nodes. It happens either during high load or low load times. We have seen that when we replace EC2 instances and bootstrap new ones, due to compactions happening on source nodes before stream starts, sometimes more than a single node was affected by this, letting us with 2 out of 3 replicas out and UnavailableExceptions in the cluster. > This issue might have relation with CASSANDRA-12590 (Segfault reading secondary index) even this is the write path. Can someone confirm if both issues could be related? > h5. Specifics of our scenario: > * Cassandra 3.9 on Amazon Linux (previous to this, we were running Cassandra 2.0.9 and there are no records of this also happening, even I was not working on Cassandra) > * 12 x i3.2xlarge EC2 instances (8 core, 64GB RAM) > * a total of 176 keyspaces (there is a per-customer pattern) > ** Some keyspaces have a single table, while others have 2 or 5 tables > ** There is a table that uses standard Secondary Indexes ("emailindex" on "user_info" table) > * It happens on both Oracle JDK 1.8.0_112 and 1.8.0_131 > * It happens in both kernel 4.9.43-17.38.amzn1.x86_64 and 3.14.35-28.38.amzn1.x86_64 > h5. Possible workarounds/solutions that we have in mind (to be validated yet) > * switching to heap_buffers (in case offheap_buffers triggers the bug), even we are still pending to measure performance degradation under that scenario. > * removing secondary indexes in favour of Materialized Views for this specific case, even we are concerned too about the fact that using MVs introduces new issues that may be present in our current Cassandra 3.9 > * Upgrading to 3.11.1 is an option, but we are trying to keep it as last resort given that the cost of migrating is big and we don't have any guarantee that new bugs that affects nodes availability are not introduced. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org For additional commands, e-mail: commits-help@cassandra.apache.org