cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subroto Barua <sbarua...@yahoo.com.INVALID>
Subject Re: unable to gossip with peers exception when internode encryption is set to any setting other than 'none'
Date Tue, 27 Aug 2019 02:01:20 GMT
 could be issue with keystore/trustore --- you may want to do keytool -- list  -- validate
the files/password; also do md5sum on files from 1 node in west and 1 node in east.check ssl
port 7001 --- from 1 node in west --> telnet <node in east>:7001 (or custom port
if you are not using default port)
    On Monday, August 26, 2019, 05:46:19 PM PDT, Michael Carlise <mcarlise@salesforce.com.INVALID>
wrote:  
 
 Subroto -
both tools error; openssl errno 111 - which made me check bound ports on the c* node with
encryption flipped.  Port 9042 is not open (determined by netstat -ant).  Looking at the
log differences for when a node is started with/without encryption.  Without encryption,
I get a bunch of lines like:
OutboundTcpConnection.java:561 - Handshaking version w/ IP
And this happens after a line like
Gossiper.java - Waiting for gossip to settle...
with encryption toggled to 'dc', I don't see any of those lines; presumable b/c the gossiper
is trying to start but doesn't.
On Mon, Aug 26, 2019 at 6:51 PM Subroto Barua <sbarua116@yahoo.com.invalid> wrote:

Michael,
Are you able to connect to any c* node via OpenSSL?
Openssl s_client -connect <ip address >:9042
Cqlsh <ip address> —ssl 
Subroto 
On Aug 26, 2019, at 2:47 PM, Marc Selwan <marc.selwan@datastax.com> wrote:


which exact version of OpenJDK are you using? Is it possible you don't have JCE on those nodes?
(I believe more recent versions of Java 8 has this baked in so that might not be it)

Marc Selwan | DataStax | PM, Server Team | (925) 413-7079 | Twitter 
  Quick links | DataStax | Training | Documentation | Downloads  



On Mon, Aug 26, 2019 at 1:56 PM Michael Carlise <mcarlise@salesforce.com.invalid> wrote:


I originally opened this issue on stackoverflow (https://stackoverflow.com/questions/57516660/cassandra-node-to-node-encryption-throws-unable-to-gossip-with-peers-exception).  
However, I haven't gotten any responses in over a week.  I'm going to post it here and maybe
someone will have an idea on where I can look.

We currently run a multi region cassandra cluster in AWS. It runs in four regions, 12 nodes
per region. It runs without node to node encryption (or client encryption either). We are
trying to enable inter datacenter node to node encryption. However, when we flip encryption
over we get an exception that nodes are unable to gossip with any peers.

It could possibly be that we didn't build our jks keystore/truststores correctly (more on
how we built these files below). But, we additionally do not see intra datacenter communication
working (which should be set to unencrypted communication). Additionally, cqlsh cannot connect
to the node either; even though we have (by default) client_auth_required set to false.
ERROR [main] 2019-08-15 18:46:32,241 CassandraDaemon.java:749 - Exception encountered during
startup
java.lang.RuntimeException: Unable to gossip with any peers
        at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1435) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:566)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:823)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:683)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:632)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:388) [apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:620)
[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:732) [apache-cassandra-3.11.4.jar:3.11.4]
INFO  [main] 2019-08-15 18:47:07,384 YamlConfigurationLoader.java:89 - Configuration location:
file:/etc/cassandra/cassandra.yaml


Something to note is that this error message occurs after a few minutes of the node being
up. (i.e. there is a delay between start up before this exception is thrown).

Information about our cassandra setup

cassandra version: 3.11.4
JDK version: openjdk-8.
Linux: Ubuntu 18.04 (bionic).

cassandra.yaml
endpoint_snitch: Ec2MultiRegionSnitch

server_encryption_options:
  internode_encryption: dc
  keystore: <omitted>
  keystore_password: <omitted>
  truststore: <omitted>
  truststore_password: <omitted>

client_encryption_options:
  enabled: false

cassandra-rackdc.properties
prefer_local=true

No obvious errors with SSH output

When starting cassandra with JVM_OPTS="$JVM_OPTS -Djavax.net.debug=ssl" added to cassandra-env.sh we
see SSL logs printed to stdout (Note: Subject and Issuer were omitted on purpose).
found key for : cassy-us-west-2                                                          
                                                                                         
                                                  
adding as trusted cert:                                                                  
                                                                                         
                                                  
  Subject: ...                                                                           
                                                                          
  Issuer:  ...                                                                           
                                                                          
  Algorithm: RSA; Serial number: 0xdad28d843fc73325d4c1a75207d4e74                       
                                                                                         
                                                  
  Valid from Fri May 27 00:00:00 UTC 2016 until Tue May 26 23:59:59 UTC 2026  

...

trigger seeding of SecureRandom
done seeding SecureRandom   

Looking at Java SE SSL/TLS connection debugging, this looks correct. But to note, we see
this series of messages (along with the RSA key signature output) repeated several times in
rapid fire. We never observe any messages about the trust store being added; however that
might be something that occurs only on client initiation (?)

Additionally, we do see cassandra report that the Encrypted Messaging service has been started.
INFO  [main] 2019-08-15 18:45:31,022 MessagingService.java:704 - Starting Encrypted Messaging
Service on SSL port 7001

Doesn't appear to be a cassandra.yaml configuration problem

We can bring the node back online by simply configuring internode_encryption: none. This
action seems to rule out a broadcast_address or rpc_address configuration problem.

How we built our keystore/truststores

We followed the basic template datastax docs for preparing SSL certificates. One minor difference
was that our private key and CSRs were generated using openssl. One per each region (we plan
to share key/signed certs across nodes in regions). This was created using a command template
as:
openssl req -new -newkey rsa:2048 -out cassy-<region>.csr -keyout cassy-<region>.key
-config cassy-<region>.conf -subj "..." -nodes -sha256

The generated CSR was then signed by an internal root CA. Because we generated our files using
openssl, we had to build our jks files by importing our certs into them.

Commands to generate truststore

We distribute this one file to all nodes.
keytool -importcert 
    -keystore generic-server-truststore.jks 
    -alias rootCa  
    -file rootCa.crt 
    -noprompt
    -keypass omitted 
    -storepass omitted 

Commands to generate keystore

This was done one per region; but essentially we created a keystore with keytool, then deleted
the key entry and then imported our key entry using keytool from a pkcs12 file.
keytool -genkeypair -keyalg RSA -alias cassy-${region} -keystore cassy-${region}.jks -storepass
omitted -keypass omitted -validity 365 -keysize 2048 -dname "..." 

keytool -delete -alias cassy-${region} -keystore cassy-${region}.jks -storepass omitted

openssl pkcs12 -export -in signed_certs/${region}.pem -inkey keys/cassandra.${region}.key
-name cassy-${region} -out ${region}.p12 

keytool -importkeystore -deststorepass omitted -destkeystore cassy-${region}.jks -srckeystore
${region}.p12 -srcstoretype PKCS12 

keytool -importcert -keystore cassy-${region}.jks -alias rootCa -file ca.crt -noprompt -keypass
omitted -storepass omitted 

Looking back at this, I don't remember why we used keytool to generate a keypair/keystore,
then deleted and imported. I think it was because the keytool importkeystore command refused
to run if the keystore didn't already exist.

ca.crt and pem file

The ca.crt file contains the root certificate and the intermediate certificate that was
used to sign the CSR. The pem file contains the signed CSR returned to us, the intermediate
cert, and the root CA (in that order).

openssl verify ca.crt and pem
openssl verify -CAfile ca.crt us-west-2.pem
signed_certs/us-west-2.pem: OK

Command output after enabling encryption

nodetool status (output truncated)
Datacenter: us-east                                                                      
                         
===================                                      
Status=Up/Down                                           
|/ State=Normal/Leaving/Joining/Moving                   
--  Address         Load       Tokens       Owns (effective)  Host ID                    
          Rack
?N  52.44.11.221    ?          256          25.4%             null                       
          1c             
...
?N  52.204.232.195  ?          256          23.2%             null                       
          1d             
Datacenter: us-west-2                                                                    
                         
=====================
Status=Up/Down                                           
|/ State=Normal/Leaving/Joining/Moving                   
--  Address         Load       Tokens       Owns (effective)  Host ID                    
          Rack           
?N  34.209.2.144    ?          256          26.5%             null                       
          2c             
UN  52.40.32.177    105.99 GiB  256          23.7%             null                      
           2c            
?N  34.210.109.203  ?          256          24.7%             null                       
          2a   
...                  

With the online node being the node with encryption set.

cqlsh to localhost
cassy-node6:~$ cqlsh
Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting
to [('127.0.0.1', 9042)]. Last error: Connection refused")})

cqlsh to remote node Remote node is a node with encryption enabled
cassy-node6:~$ cqlsh 10.0.2.7
Connection error: ('Unable to connect to any servers', {'10.0.2.7': error(111, "Tried connecting
to [('10.0.2.7', 9042)]. Last error: Connection refused")})


  
Mime
View raw message