cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Carlise <mcarl...@salesforce.com.INVALID>
Subject Re: unable to gossip with peers exception when internode encryption is set to any setting other than 'none'
Date Tue, 27 Aug 2019 00:45:33 GMT
Subroto -

both tools error; openssl errno 111 - which made me check bound ports on
the c* node with encryption flipped.  Port 9042 is not open (determined by
netstat -ant).  Looking at the log differences for when a node is started
with/without encryption.  Without encryption, I get a bunch of lines like:

OutboundTcpConnection.java:561 - Handshaking version w/ IP

And this happens after a line like

Gossiper.java - Waiting for gossip to settle...

with encryption toggled to 'dc', I don't see any of those lines; presumable
b/c the gossiper is trying to start but doesn't.

On Mon, Aug 26, 2019 at 6:51 PM Subroto Barua <sbarua116@yahoo.com.invalid>
wrote:

> Michael,
>
> Are you able to connect to any c* node via OpenSSL?
>
> Openssl s_client -connect <ip address >:9042
>
> Cqlsh <ip address> —ssl
>
> Subroto
>
> On Aug 26, 2019, at 2:47 PM, Marc Selwan <marc.selwan@datastax.com> wrote:
>
> which exact version of OpenJDK are you using? Is it possible you don't
> have JCE on those nodes? (I believe more recent versions of Java 8 has this
> baked in so that might not be it)
>
>
> *Marc Selwan | *DataStax *| *PM, Server Team *|* *(925) 413-7079* *|*
> Twitter <https://twitter.com/MarcSelwan>
>
> *  Quick links | *DataStax <http://www.datastax.com> *| *Training
> <http://www.academy.datastax.com> *| *Documentation
> <http://www.datastax.com/documentation/getting_started/doc/getting_started/gettingStartedIntro_r.html>
>  *| *Downloads <http://www.datastax.com/download>
>
>
>
> On Mon, Aug 26, 2019 at 1:56 PM Michael Carlise <
> mcarlise@salesforce.com.invalid> wrote:
>
>>
>> I originally opened this issue on stackoverflow (
>> https://stackoverflow.com/questions/57516660/cassandra-node-to-node-encryption-throws-unable-to-gossip-with-peers-exception
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_57516660_cassandra-2Dnode-2Dto-2Dnode-2Dencryption-2Dthrows-2Dunable-2Dto-2Dgossip-2Dwith-2Dpeers-2Dexception&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=KdhQDpMbz8v1GYrbdYL_opGq-GBPXftrEYEkgcGeMp0&s=4CR8PRQopb4FyCLj8PDI44mSouBz65Yx8THnH8tOb7o&e=>
>> ).
>>
>> However, I haven't gotten any responses in over a week.  I'm going to
>> post it here and maybe someone will have an idea on where I can look.
>>
>> We currently run a multi region cassandra cluster in AWS. It runs in four
>> regions, 12 nodes per region. It runs without node to node encryption (or
>> client encryption either). We are trying to enable inter datacenter node to
>> node encryption. However, when we flip encryption over we get an exception
>> that nodes are unable to gossip with any peers.
>>
>> It could possibly be that we didn't build our jks keystore/truststores
>> correctly (more on how we built these files below). But, we additionally do
>> not see intra datacenter communication working (which should be set to
>> unencrypted communication). Additionally, cqlsh cannot connect to the node
>> either; even though we have (by default) client_auth_required set to
>> false.
>>
>> ERROR [main] 2019-08-15 18:46:32,241 CassandraDaemon.java:749 - Exception encountered
during startup
>> java.lang.RuntimeException: Unable to gossip with any peers
>>         at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1435) ~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:566)
~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:823)
~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:683)
~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:632)
~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:388)
[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:620)
[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:732)
[apache-cassandra-3.11.4.jar:3.11.4]
>> INFO  [main] 2019-08-15 18:47:07,384 YamlConfigurationLoader.java:89 - Configuration
location: file:/etc/cassandra/cassandra.yaml
>>
>>
>> Something to note is that this error message occurs after a few minutes
>> of the node being up. (i.e. there is a delay between start up before this
>> exception is thrown).
>>
>> *Information about our cassandra setup*
>>
>> cassandra version: 3.11.4
>> JDK version: openjdk-8.
>> Linux: Ubuntu 18.04 (bionic).
>>
>> *cassandra.yaml*
>>
>> endpoint_snitch: Ec2MultiRegionSnitch
>>
>> server_encryption_options:
>>   internode_encryption: dc
>>   keystore: <omitted>
>>   keystore_password: <omitted>
>>   truststore: <omitted>
>>   truststore_password: <omitted>
>>
>> client_encryption_options:
>>   enabled: false
>>
>> *cassandra-rackdc.properties*
>>
>> prefer_local=true
>>
>> *No obvious errors with SSH output*
>>
>> When starting cassandra with JVM_OPTS="$JVM_OPTS -Djavax.net.debug=ssl" added
>> to cassandra-env.sh we see SSL logs printed to stdout (*Note: Subject
>> and Issuer were omitted on purpose)*.
>>
>> found key for : cassy-us-west-2
>> adding as trusted cert:
>>   Subject: ...
>>   Issuer:  ...
>>   Algorithm: RSA; Serial number: 0xdad28d843fc73325d4c1a75207d4e74
>>   Valid from Fri May 27 00:00:00 UTC 2016 until Tue May 26 23:59:59 UTC 2026
>>
>> ...
>>
>> trigger seeding of SecureRandom
>> done seeding SecureRandom
>>
>> Looking at Java SE SSL/TLS connection debugging
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.oracle.com_javase_7_docs_technotes_guides_security_jsse_ReadDebug.html&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=KdhQDpMbz8v1GYrbdYL_opGq-GBPXftrEYEkgcGeMp0&s=SR3ashwvSRxA75nBjGDwjAwq65nDuBZUaDOvHPGDrps&e=>,
>> this looks correct. But to note, we see this series of messages (along with
>> the RSA key signature output) repeated several times in rapid fire. We
>> never observe any messages about the trust store being added; however that
>> might be something that occurs only on client initiation (?)
>>
>> Additionally, we do see cassandra report that the Encrypted Messaging
>> service has been started.
>>
>> INFO  [main] 2019-08-15 18:45:31,022 MessagingService.java:704 - Starting Encrypted
Messaging Service on SSL port 7001
>>
>> *Doesn't appear to be a cassandra.yaml configuration problem*
>>
>> We can bring the node back online by simply configuring internode_encryption:
>> none. This action seems to rule out a broadcast_address or rpc_address
>> configuration problem.
>>
>> *How we built our keystore/truststores*
>>
>> We followed the basic template datastax docs for preparing SSL
>> certificates
>> <https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/configuration/secureSSLCertWithCA.html>.
>> One minor difference was that our private key and CSRs were generated using
>> openssl. One per each region (we plan to share key/signed certs across
>> nodes in regions). This was created using a command template as:
>>
>> openssl req -new -newkey rsa:2048 -out cassy-<region>.csr -keyout cassy-<region>.key
-config cassy-<region>.conf -subj "..." -nodes -sha256
>>
>> The generated CSR was then signed by an internal root CA. Because we
>> generated our files using openssl, we had to build our jks files by
>> importing our certs into them.
>>
>> *Commands to generate truststore*
>>
>> We distribute this one file to all nodes.
>>
>> keytool -importcert
>>     -keystore generic-server-truststore.jks
>>     -alias rootCa
>>     -file rootCa.crt
>>     -noprompt
>>     -keypass omitted
>>     -storepass omitted
>>
>> *Commands to generate keystore*
>>
>> This was done one per region; but essentially we created a keystore with
>> keytool, then deleted the key entry and then imported our key entry using
>> keytool from a pkcs12 file.
>>
>> keytool -genkeypair -keyalg RSA -alias cassy-${region} -keystore cassy-${region}.jks
-storepass omitted -keypass omitted -validity 365 -keysize 2048 -dname "..."
>>
>> keytool -delete -alias cassy-${region} -keystore cassy-${region}.jks -storepass omitted
>>
>> openssl pkcs12 -export -in signed_certs/${region}.pem -inkey keys/cassandra.${region}.key
-name cassy-${region} -out ${region}.p12
>>
>> keytool -importkeystore -deststorepass omitted -destkeystore cassy-${region}.jks
-srckeystore ${region}.p12 -srcstoretype PKCS12
>>
>> keytool -importcert -keystore cassy-${region}.jks -alias rootCa -file ca.crt -noprompt
-keypass omitted -storepass omitted
>>
>> Looking back at this, I don't remember why we used keytool to generate a
>> keypair/keystore, then deleted and imported. I think it was because the
>> keytool importkeystore command refused to run if the keystore didn't
>> already exist.
>>
>> *ca.crt and pem file*
>>
>> The ca.crt file contains the root certificate and the intermediate
>> certificate that was used to sign the CSR. The pem file contains the signed
>> CSR returned to us, the intermediate cert, and the root CA (in that order).
>>
>> *openssl verify ca.crt and pem*
>>
>> openssl verify -CAfile ca.crt us-west-2.pem
>> signed_certs/us-west-2.pem: OK
>>
>> *Command output after enabling encryption*
>>
>> *nodetool status (output truncated)*
>>
>> Datacenter: us-east
>> ===================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address         Load       Tokens       Owns (effective)  Host ID           
                   Rack
>> ?N  52.44.11.221    ?          256          25.4%             null              
                   1c
>> ...
>> ?N  52.204.232.195  ?          256          23.2%             null              
                   1d
>> Datacenter: us-west-2
>> =====================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address         Load       Tokens       Owns (effective)  Host ID           
                   Rack
>> ?N  34.209.2.144    ?          256          26.5%             null              
                   2c
>> UN  52.40.32.177    105.99 GiB  256          23.7%             null             
                    2c
>> ?N  34.210.109.203  ?          256          24.7%             null              
                   2a
>> ...
>>
>> With the online node being the node with encryption set.
>>
>> *cqlsh to localhost*
>>
>> cassy-node6:~$ cqlsh
>> Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried
connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
>>
>> *cqlsh to remote node* Remote node is a node with encryption enabled
>>
>> cassy-node6:~$ cqlsh 10.0.2.7
>> Connection error: ('Unable to connect to any servers', {'10.0.2.7': error(111, "Tried
connecting to [('10.0.2.7', 9042)]. Last error: Connection refused")})
>>
>>

Mime
View raw message