cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenneth Brotman" <kenbrot...@yahoo.com.INVALID>
Subject RE: 答复: A node down every day in a 6 nodes cluster
Date Tue, 27 Mar 2018 11:45:06 GMT
David,

 

Can you replace the misbehaving node to see if that resolves the problem?

 

Kenneth Brotman

 

From: Xiangfei Ni [mailto:xiangfei.ni@cm-dt.com] 
Sent: Tuesday, March 27, 2018 3:27 AM
To: Jeff Jirsa
Cc: user@cassandra.apache.org
Subject: 答复: 答复: A node down every day in a 6 nodes cluster

 

Thanks Jeff,

           So your suggestion is to first resolve the data model issue which cause wide partition,right?

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811|Tel: + 86 27 5024 2516

 

发件人: Jeff Jirsa <jjirsa@gmail.com> 
发送时间: 2018年3月27日 11:50
收件人: Xiangfei Ni <xiangfei.ni@cm-dt.com>
抄送: user@cassandra.apache.org
主题: Re: 答复: A node down every day in a 6 nodes cluster

 

Only one node having the problem is suspicious. May be that your application is improperly
pooling connections, or you have a hardware problem.

 

I dont see anything in nodetool that explains it, though you certainly have a data model likely
to cause problems over time (the cardinality of 

rt_ac_stat.idx_rt_ac_stat_prot_verrt_ac_stat.idx_rt_ac_stat_prot_ver is such that you have
very wide partitions and it'll be difficult to read).
 
 

 

On Mon, Mar 26, 2018 at 8:26 PM, Xiangfei Ni <xiangfei.ni@cm-dt.com> wrote:

Hi Jeff,

    I need to restart the node manually every time,only one node has this problem.

    I have attached the nodetool output,thanks.

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516>


 

发件人: Jeff Jirsa <jjirsa@gmail.com> 
发送时间: 2018年3月27日 11:03
收件人: user@cassandra.apache.org
主题: Re: A node down every day in a 6 nodes cluster

 

That warning isn’t sufficient to understand why the node is going down

 

 

Cassandra 3.9 has some pretty serious known issues - upgrading to 3.11.3 is likely a good
idea

 

Are the nodes coming up on their own? Or are you restarting them?

 

Paste the output of nodetool tpstats and nodetool cfstats

 

 

 

-- 

Jeff Jirsa

 


On Mar 26, 2018, at 7:56 PM, Xiangfei Ni <xiangfei.ni@cm-dt.com> wrote:

Hi Cassandra experts,

  I am facing an issue,a node downs every day in a 6 nodes cluster,the cluster is just in
one DC,

  Every node has 4C 16G,and the heap configuration is MAX_HEAP_SIZE=8192m HEAP_NEWSIZE=512m,every
node load about 200G data,the RF for the business CF is 3,a node downs one time every day,the
system.log shows below info:

WARN  [Native-Transport-Requests-19] 2018-03-26 18:53:17,128 CassandraAuthorizer.java:101
- CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.latest_rt_alarm>

ERROR [Native-Transport-Requests-19] 2018-03-26 18:53:17,129 QueryMessage.java:128 - Unexpected
error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only
0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
~[guava-18.0.jar:na]

        at org.apache.cassandra.auth.AuthCache.get(AuthCache.java:108) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:45)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.authorize(ClientState.java:419) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:352)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:329)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:316) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:300)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:211)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:185)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:219) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:204) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:115)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:513)
[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:407)
[apache-cassandra-3.9.jar:3.9]

        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
[netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366)
[netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35)
[netty-all-4.0.39.Final.jar:4.0.39.Final]

        at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357)
[netty-all-4.0.39.Final.jar:4.0.39.Final]

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]

        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.9.jar:3.9]

        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]

Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:102)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.PermissionsCache.lambda$new$0(PermissionsCache.java:37)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.AuthCache$1.load(AuthCache.java:183) ~[apache-cassandra-3.9.jar:3.9]

        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
~[guava-18.0.jar:na]

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[guava-18.0.jar:na]

        ... 26 common frames omitted

Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received
only 0 responses.

        at org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1718)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1667) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1608) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1527) ~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:975)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:271)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:232)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.addPermissionsForRole(CassandraAuthorizer.java:227)
~[apache-cassandra-3.9.jar:3.9]

        at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:93)
~[apache-cassandra-3.9.jar:3.9]

        ... 32 common frames omitted

WARN  [Native-Transport-Requests-23] 2018-03-26 18:53:17,131 CassandraAuthorizer.java:101
- CassandraAuthorizer failed to authorize #<User nev_tsp_sa> for <table nev_prod_tsp.rt_alarm_unite>

ERROR [Native-Transport-Requests-64] 2018-03-26 18:53:17,135 QueryMessage.java:128 - Unexpected
error during query

com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only
0 responses.

        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[guava-18.0.jar:na]

 

I have confirmed that nev_tsp_sa has all rights on nev_prod_tsp keyspace:

cassandra@cqlsh:system_auth> select * from role_permissions where role = 'nev_tsp_sa';

 

role       | resource          | permissions

------------+-------------------+--------------------------------------------------------------

nev_tsp_sa | data/nev_prod_tsp | {'ALTER', 'AUTHORIZE', 'CREATE', 'DROP', 'MODIFY', 'SELECT'}

 

the cache disk can be read/write as normal.

 

Highly appreciated if anyone can help,thanks very much !

 

 

Best Regards, 

 

倪项菲/ David Ni

中移德电网络科技有限公司

Virtue Intelligent Network Ltd, co.

Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei

Mob: +86 13797007811 <tel:+86%20137%209700%207811> |Tel: + 86 27 5024 2516 <tel:+86%2027%205024%202516>


 

 


Mime
View raw message