From user-return-64547-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org  Mon Oct 14 21:34:55 2019
Return-Path: <user-return-64547-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id CDEC2180648
	for <archive-asf-public@cust-asf.ponee.io>; Mon, 14 Oct 2019 23:34:54 +0200 (CEST)
Received: (qmail 68520 invoked by uid 500); 14 Oct 2019 21:34:50 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@cassandra.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@cassandra.apache.org>
List-Post: <mailto:user@cassandra.apache.org>
List-Id: <user.cassandra.apache.org>
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 68510 invoked by uid 99); 14 Oct 2019 21:34:50 -0000
Received: from ui-eu-02.ponee.io (HELO localhost) (116.202.110.96)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Oct 2019 21:34:50 +0000
To: <user@cassandra.apache.org>
From: Sergio Bilello <lapostadisergio@gmail.com>
MIME-Version: 1.0
Date: Mon, 14 Oct 2019 21:34:49 -0000
Subject: Cassadra node join problem
In-Reply-To: 
Message-ID: <pony-fbd14426bd4c58e5302921357e6bd08419fa77b7-67a22171455cd2f90651a6c9a4c8350b46d42ed0@user.cassandra.apache.org>
x-ponymail-agent: PonyMail Composer/0.2
References:  
Content-Type: text/plain; charset=utf-8
X-Mailer: LuaSocket 3.0-rc1
x-ponymail-sender: fbd14426bd4c58e5302921357e6bd08419fa77b7

Problem:
The cassandra node does not work even after restart throwing this exception:
WARN  [Thread-83069] 2019-10-11 16:13:23,713 CustomTThreadPoolServer.java:125 - Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Socket closed
at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:109) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:36) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:60) ~[libthrift-0.9.2.jar:0.9.2]
at org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:113) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.run(ThriftServer.java:134) [apache-cassandra-3.11.4.jar:3.11.4]

The CPU Load goes to 50 and it becomes unresponsive.

Node configuration:
OS: Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

This is a working node that does not have the recommended settings but it is working and it is one of the first node in the cluster
cat /proc/23935/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             122422               122422               processes
Max open files            65536                65536                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       122422               122422               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us


I tried to bootstrap a new node that joins the existing cluster. 
The disk space used is around 400GB SSD over 885GB available

At my first attempt, the node failed and got restarted over and over by systemctl that does not 
honor the limits configuration specified and thrown

Caused by: java.nio.file.FileSystemException: /mnt/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/md-52-big-Index.db: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[na:1.8.0_161]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[na:1.8.0_161]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[na:1.8.0_161]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[na:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[na:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[na:1.8.0_161]
at org.apache.cassandra.io.util.SequentialWriter.openChannel(SequentialWriter.java:104) ~[apache-cassandra-3.11.4.jar:3.11.4]
.. 20 common frames omitted
^C

I fixed  the above by stopping cassandra, cleaning commitlog, saved_caches, hints and data directory and restarting it and getting the PID and run the 2 commands below
sudo prlimit -n1048576 -p <JAVA_PID>
sudo prlimit -u32768 -p <CASSANDRA_PID>
because at the beginning the node didn't even joint the cluster. it was reported by UJ.

After fixing the max open file problem, The node from UpJoining passed to the status UpNormal
The node joined the cluster but after a while, it started to throw

WARN  [Thread-83069] 2019-10-11 16:13:23,713 CustomTThreadPoolServer.java:125 - Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: Socket closed
at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:109) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:36) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:60) ~[libthrift-0.9.2.jar:0.9.2]
at org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:113) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.run(ThriftServer.java:134) [apache-cassandra-3.11.4.jar:3.11.4]


I compared cassandra.yaml, limits.conf but it looks like that it does not help. I don't know how the current nodes are working since they don't have the recommended cassandra limits.

Any suggestions on the possible culprit?

Please let me know 

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org