cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenneth Brotman" <kenbrot...@yahoo.com.INVALID>
Subject RE: 答复: 答复: A node down every day in a 6 nodes cluster
Date Wed, 28 Mar 2018 12:59:56 GMT
Properly Sizing Your Heap to Prevent OutOfMemoryErrors

https://support.datastax.com/hc/en-us/articles/204225929-Properly-Sizing-Your-Heap-to-Prevent-OutOfMemoryErrors

 

 

From: Kenneth Brotman [mailto:kenbrotman@yahoo.com.INVALID] 
Sent: Wednesday, March 28, 2018 5:35 AM
To: user@cassandra.apache.org
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

If you think that will fix the problem, maybe you could add a little more memory to each machine
as a short term fix.

 

From: Xiangfei Ni [mailto:xiangfei.ni@cm-dt.com] 
Sent: Wednesday, March 28, 2018 5:24 AM
To: user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Yes ,we discussed and plan to figured out the data model issue and upgrade to 3.11.3 version.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman <kenbrotman@yahoo.com.INVALID> 
发送时间: 2018年3月28日 20:16
收件人: user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David, 

 

Did you figure out what to do about the data model problem?  It could be that your data files
finally grow to the point that the data model problem caused the Java heap space issue –
in which case everything is actually working as it’s supposed to; You just have to fix the
data model.

 

Kenneth Brotman

 

 

From: Kenneth Brotman [ <mailto:kenbrotman@yahoo.com> mailto:kenbrotman@yahoo.com] 
Sent: Wednesday, March 28, 2018 4:46 AM
To: 'user@cassandra.apache.org'
Subject: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

Was any change to hardware done around the time the problem started ?

Was any change to the client software done around the time the problem started?

Was any change to the database schema done around the time the problem started?

 

Kenneth Brotman

 

From: Xiangfei Ni [ <mailto:xiangfei.ni@cm-dt.com> mailto:xiangfei.ni@cm-dt.com] 
Sent: Wednesday, March 28, 2018 4:40 AM
To:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
Subject: 答复: 答复: 答复: A node down every day in a 6 nodes cluster

 

Hi Kenneth,

    The cluster has been running for 4 months,

    The problem occurred from last week,

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Kenneth Brotman < <mailto:kenbrotman@yahoo.com.INVALID> kenbrotman@yahoo.com.INVALID>

发送时间: 2018年3月28日 19:34
收件人:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
主题: RE: 答复: 答复: A node down every day in a 6 nodes cluster

 

David,

 

How long has the cluster been operating?

How long has the problem been occurring?

 

Kenneth Brotman

 

From: Jeff Jirsa [ <mailto:jjirsa@gmail.com> mailto:jjirsa@gmail.com] 
Sent: Tuesday, March 27, 2018 7:00 PM
To: Xiangfei Ni
Cc:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
Subject: Re: 答复: 答复: A node down every day in a 6 nodes cluster

 

 

java.langOutOfMemoryError: Java heap space

 

 

You’re oom’ ing 

 

-- 

Jeff Jirsa

 


On Mar 27, 2018, at 6:45 PM, Xiangfei Ni <xiangfei.ni@cm-dt.com> wrote:

Hi Jeff,

    Today another node was shutdown,I have attached the exception log file,could you please
help to analyze?Thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Jeff Jirsa < <mailto:jjirsa@gmail.com> jjirsa@gmail.com> 
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni < <mailto:xiangfei.ni@cm-dt.com> xiangfei.ni@cm-dt.com>
抄送:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

 

Only one node having the problem is suspicious. May be that your application is improperly
pooling connections, or you have a hardware problem.

 

I dont see anything in nodetool that explains it, though you certainly have a data model likely
to cause problems over time (the cardinality of 

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have
very wide partitions and it'll be difficult to read).
 
 

 

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xiangfei.ni@cm-dt.com> wrote:

Hi Jeff,

    I need to restart the node manually every time,only one node has this problem.

    I have attached the nodetool output,thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob:  <tel:+86%20137%209700%207811> +86 13797007811|Tel:  <tel:+86%2027%205024%202516>
+ 86 27 5024 2516

 

发件人: Jeff Jirsa < <mailto:jjirsa@gmail.com> jjirsa@gmail.com> 
发送时间: 2018年3月27日 11:03
收件人:  <mailto:user@cassandra.apache.org> user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

 

That warning isn’t sufficient to understand why the node is going down

 

 

Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good
idea

 

Are the nodes coming up on their own? Or are you restarting them?

 

Paste the output of nodetool tpstats and nodetool cfstats

 

 

 

-- 

Jeff Jirsa

 


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xiangfei.ni@cm-dt.com> wrote:

Hi Cassandra experts,

  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in
one DC,

  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every
node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the
system.log shows below info:

WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101
- CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>

ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected
error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only
0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-180.jar:na]

        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
~[guava-18.0.jar:na]

        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessagejava:115)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513)
[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407)
[apache-cassandra-3.9.jar:3.9]

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
[netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366)
[netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35)
[netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357)
[netty-all-4.0.39.Final.jar:4.0.39.Final]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]

        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]

        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]

        .. 26 common frames omitted

Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received
only 0 responses.

        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93)
~[apache-cassandra-3.9.jar:3.9]

        .. 32 common frames omitted

WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101
- CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>

ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected
error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only
0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

 

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:

cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

 

role       | resource          | permissions

------------+-------------------+--------------------------------------------------------------

nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

 

the cache disk can be read/write as normal.

 

Highly appreciated if anyone can help,thanks very much !

 

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob:  <tel:+86%20137%209700%207811> +86 13797007811|Tel:  <tel:+86%2027%205024%202516>
+ 86 27 5024 2516

 

 

<log.txt>


Mime
View raw message