cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Carlise <mcarl...@salesforce.com.INVALID>
Subject Re: unable to gossip with peers exception when internode encryption is set to any setting other than 'none'
Date Wed, 28 Aug 2019 18:49:04 GMT
telnet from node 1 -> node2 7001 (and 7000) works.

However, I can't rule out a JKS keystore/truststore issue.  I have tried a
number of configurations and none of them have seemed to help (or emit any
further error logging).   We have a root and intermediate CA cert, and a
private key + signed CSR.  Our keystore has a single privateKeyentry of
length 2: consisting of the signed CSR and the intermediate cert (in that
order).  The truststore has a single entry of length one: consisting of the
root cert used to issue the intermediate.  Does anybody know if that is the
correct setup for JKS.  This setup was given to us by another team in our
company that uses java much more than us.

Some other points to note: Cassandra-9386 issue points out that 'dc'
internode_encryption when using Ec2MultiRegionSnitch doesn't work correctly
(always uses encrypted connections).  But I still can't get 'all' to work.
The way I'm trying to get it to work is by just simply flipping encryption
on in two non-seed nodes in the same datacenter.  I notice that in
system.log I can see them both output the message 'Handshaking with
/private IP'.  But then a few minutes later the unable to gossip exception
is thrown.  No other information/logs are given; so I assume the handshake
failed? presumably b/c incorrect truststore/keystore?

I can't seem to find any concrete information about how to setup the
keystore cert chain and/or the truststore. Does anybody know of any good
sources on this topic, or know at the top of the minds how this setup is
supposed to be?


On Mon, Aug 26, 2019 at 10:01 PM Subroto Barua <sbarua116@yahoo.com.invalid>
wrote:

> could be issue with keystore/trustore --- you may want to do keytool --
> list  -- validate the files/password; also do md5sum on files from 1 node
> in west and 1 node in east.
> check ssl port 7001 --- from 1 node in west --> telnet <node in east>:7001
> (or custom port if you are not using default port)
>
> On Monday, August 26, 2019, 05:46:19 PM PDT, Michael Carlise
> <mcarlise@salesforce.com.INVALID> wrote:
>
>
> Subroto -
>
> both tools error; openssl errno 111 - which made me check bound ports on
> the c* node with encryption flipped.  Port 9042 is not open (determined by
> netstat -ant).  Looking at the log differences for when a node is started
> with/without encryption.  Without encryption, I get a bunch of lines like:
>
> OutboundTcpConnection.java:561 - Handshaking version w/ IP
>
> And this happens after a line like
>
> Gossiper.java - Waiting for gossip to settle...
>
> with encryption toggled to 'dc', I don't see any of those lines;
> presumable b/c the gossiper is trying to start but doesn't.
>
> On Mon, Aug 26, 2019 at 6:51 PM Subroto Barua <sbarua116@yahoo.com.invalid>
> wrote:
>
> Michael,
>
> Are you able to connect to any c* node via OpenSSL?
>
> Openssl s_client -connect <ip address >:9042
>
> Cqlsh <ip address> —ssl
>
> Subroto
>
> On Aug 26, 2019, at 2:47 PM, Marc Selwan <marc.selwan@datastax.com> wrote:
>
> which exact version of OpenJDK are you using? Is it possible you don't
> have JCE on those nodes? (I believe more recent versions of Java 8 has this
> baked in so that might not be it)
>
>
> *Marc Selwan | *DataStax *| *PM, Server Team *|* *(925) 413-7079* *|*
> Twitter <https://twitter.com/MarcSelwan>
>
> *  Quick links | *DataStax <http://www.datastax.com> *| *Training
> <http://www.academy.datastax.com> *| *Documentation
> <http://www.datastax.com/documentation/getting_started/doc/getting_started/gettingStartedIntro_r.html>
>  *| *Downloads <http://www.datastax.com/download>
>
>
>
> On Mon, Aug 26, 2019 at 1:56 PM Michael Carlise <
> mcarlise@salesforce.com.invalid> wrote:
>
>
> I originally opened this issue on stackoverflow (
> https://stackoverflow.com/questions/57516660/cassandra-node-to-node-encryption-throws-unable-to-gossip-with-peers-exception
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_57516660_cassandra-2Dnode-2Dto-2Dnode-2Dencryption-2Dthrows-2Dunable-2Dto-2Dgossip-2Dwith-2Dpeers-2Dexception&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=KdhQDpMbz8v1GYrbdYL_opGq-GBPXftrEYEkgcGeMp0&s=4CR8PRQopb4FyCLj8PDI44mSouBz65Yx8THnH8tOb7o&e=>
> ).
>
> However, I haven't gotten any responses in over a week.  I'm going to post
> it here and maybe someone will have an idea on where I can look.
>
> We currently run a multi region cassandra cluster in AWS. It runs in four
> regions, 12 nodes per region. It runs without node to node encryption (or
> client encryption either). We are trying to enable inter datacenter node to
> node encryption. However, when we flip encryption over we get an exception
> that nodes are unable to gossip with any peers.
>
> It could possibly be that we didn't build our jks keystore/truststores
> correctly (more on how we built these files below). But, we additionally do
> not see intra datacenter communication working (which should be set to
> unencrypted communication). Additionally, cqlsh cannot connect to the node
> either; even though we have (by default) client_auth_required set to false
> .
>
> ERROR [main] 2019-08-15 18:46:32,241 CassandraDaemon.java:749 - Exception encountered
during startup
> java.lang.RuntimeException: Unable to gossip with any peers
>         at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1435) ~[apache-cassandra-3.11.4.jar:3.11.4]
>         at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:566)
~[apache-cassandra-3.11.4.jar:3.11.4]
>         at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:823)
~[apache-cassandra-3.11.4.jar:3.11.4]
>         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:683)
~[apache-cassandra-3.11.4.jar:3.11.4]
>         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:632)
~[apache-cassandra-3.11.4.jar:3.11.4]
>         at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:388)
[apache-cassandra-3.11.4.jar:3.11.4]
>         at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:620)
[apache-cassandra-3.11.4.jar:3.11.4]
>         at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:732)
[apache-cassandra-3.11.4.jar:3.11.4]
> INFO  [main] 2019-08-15 18:47:07,384 YamlConfigurationLoader.java:89 - Configuration
location: file:/etc/cassandra/cassandra.yaml
>
>
> Something to note is that this error message occurs after a few minutes of
> the node being up. (i.e. there is a delay between start up before this
> exception is thrown).
>
> *Information about our cassandra setup*
>
> cassandra version: 3.11.4
> JDK version: openjdk-8.
> Linux: Ubuntu 18.04 (bionic).
>
> *cassandra.yaml*
>
> endpoint_snitch: Ec2MultiRegionSnitch
>
> server_encryption_options:
>   internode_encryption: dc
>   keystore: <omitted>
>   keystore_password: <omitted>
>   truststore: <omitted>
>   truststore_password: <omitted>
>
> client_encryption_options:
>   enabled: false
>
> *cassandra-rackdc.properties*
>
> prefer_local=true
>
> *No obvious errors with SSH output*
>
> When starting cassandra with JVM_OPTS="$JVM_OPTS -Djavax.net.debug=ssl" added
> to cassandra-env.sh we see SSL logs printed to stdout (*Note: Subject and
> Issuer were omitted on purpose)*.
>
> found key for : cassy-us-west-2
> adding as trusted cert:
>   Subject: ...
>   Issuer:  ...
>   Algorithm: RSA; Serial number: 0xdad28d843fc73325d4c1a75207d4e74
>   Valid from Fri May 27 00:00:00 UTC 2016 until Tue May 26 23:59:59 UTC 2026
>
> ...
>
> trigger seeding of SecureRandom
> done seeding SecureRandom
>
> Looking at Java SE SSL/TLS connection debugging
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.oracle.com_javase_7_docs_technotes_guides_security_jsse_ReadDebug.html&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=KdhQDpMbz8v1GYrbdYL_opGq-GBPXftrEYEkgcGeMp0&s=SR3ashwvSRxA75nBjGDwjAwq65nDuBZUaDOvHPGDrps&e=>,
> this looks correct. But to note, we see this series of messages (along with
> the RSA key signature output) repeated several times in rapid fire. We
> never observe any messages about the trust store being added; however that
> might be something that occurs only on client initiation (?)
>
> Additionally, we do see cassandra report that the Encrypted Messaging
> service has been started.
>
> INFO  [main] 2019-08-15 18:45:31,022 MessagingService.java:704 - Starting Encrypted Messaging
Service on SSL port 7001
>
> *Doesn't appear to be a cassandra.yaml configuration problem*
>
> We can bring the node back online by simply configuring internode_encryption:
> none. This action seems to rule out a broadcast_address or rpc_address
> configuration problem.
>
> *How we built our keystore/truststores*
>
> We followed the basic template datastax docs for preparing SSL
> certificates
> <https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/configuration/secureSSLCertWithCA.html>.
> One minor difference was that our private key and CSRs were generated using
> openssl. One per each region (we plan to share key/signed certs across
> nodes in regions). This was created using a command template as:
>
> openssl req -new -newkey rsa:2048 -out cassy-<region>.csr -keyout cassy-<region>.key
-config cassy-<region>.conf -subj "..." -nodes -sha256
>
> The generated CSR was then signed by an internal root CA. Because we
> generated our files using openssl, we had to build our jks files by
> importing our certs into them.
>
> *Commands to generate truststore*
>
> We distribute this one file to all nodes.
>
> keytool -importcert
>     -keystore generic-server-truststore.jks
>     -alias rootCa
>     -file rootCa.crt
>     -noprompt
>     -keypass omitted
>     -storepass omitted
>
> *Commands to generate keystore*
>
> This was done one per region; but essentially we created a keystore with
> keytool, then deleted the key entry and then imported our key entry using
> keytool from a pkcs12 file.
>
> keytool -genkeypair -keyalg RSA -alias cassy-${region} -keystore cassy-${region}.jks
-storepass omitted -keypass omitted -validity 365 -keysize 2048 -dname "..."
>
> keytool -delete -alias cassy-${region} -keystore cassy-${region}.jks -storepass omitted
>
> openssl pkcs12 -export -in signed_certs/${region}.pem -inkey keys/cassandra.${region}.key
-name cassy-${region} -out ${region}.p12
>
> keytool -importkeystore -deststorepass omitted -destkeystore cassy-${region}.jks -srckeystore
${region}.p12 -srcstoretype PKCS12
>
> keytool -importcert -keystore cassy-${region}.jks -alias rootCa -file ca.crt -noprompt
-keypass omitted -storepass omitted
>
> Looking back at this, I don't remember why we used keytool to generate a
> keypair/keystore, then deleted and imported. I think it was because the
> keytool importkeystore command refused to run if the keystore didn't
> already exist.
>
> *ca.crt and pem file*
>
> The ca.crt file contains the root certificate and the intermediate
> certificate that was used to sign the CSR. The pem file contains the signed
> CSR returned to us, the intermediate cert, and the root CA (in that order).
>
> *openssl verify ca.crt and pem*
>
> openssl verify -CAfile ca.crt us-west-2.pem
> signed_certs/us-west-2.pem: OK
>
> *Command output after enabling encryption*
>
> *nodetool status (output truncated)*
>
> Datacenter: us-east
> ===================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address         Load       Tokens       Owns (effective)  Host ID               
               Rack
> ?N  52.44.11.221    ?          256          25.4%             null                  
               1c
> ...
> ?N  52.204.232.195  ?          256          23.2%             null                  
               1d
> Datacenter: us-west-2
> =====================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address         Load       Tokens       Owns (effective)  Host ID               
               Rack
> ?N  34.209.2.144    ?          256          26.5%             null                  
               2c
> UN  52.40.32.177    105.99 GiB  256          23.7%             null                 
                2c
> ?N  34.210.109.203  ?          256          24.7%             null                  
               2a
> ...
>
> With the online node being the node with encryption set.
>
> *cqlsh to localhost*
>
> cassy-node6:~$ cqlsh
> Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried
connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
>
> *cqlsh to remote node* Remote node is a node with encryption enabled
>
> cassy-node6:~$ cqlsh 10.0.2.7
> Connection error: ('Unable to connect to any servers', {'10.0.2.7': error(111, "Tried
connecting to [('10.0.2.7', 9042)]. Last error: Connection refused")})
>
>

Mime
View raw message