cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Carlise <mcarl...@salesforce.com.INVALID>
Subject Re: unable to gossip with peers exception when internode encryption is set to any setting other than 'none'
Date Wed, 28 Aug 2019 21:32:21 GMT
For clarity for anybody that comes to this chain in the archive.  This
might be an issue with Ec2MultiRegionSnitch all together; not sure.  But if
I create a local 3 node cluster using ccm (cassandra v 3.11.4).  I can drop
the keystore/truststore jks files in, and flip encryption and everything
works as expected.  Tomorrow I'll reach out to the slack channel and see if
anybody can help/suggest ways to test it; or if anybody is aware of an
ongoing issue.

On Wed, Aug 28, 2019 at 2:49 PM Michael Carlise <mcarlise@salesforce.com>
wrote:

> telnet from node 1 -> node2 7001 (and 7000) works.
>
> However, I can't rule out a JKS keystore/truststore issue.  I have tried a
> number of configurations and none of them have seemed to help (or emit any
> further error logging).   We have a root and intermediate CA cert, and a
> private key + signed CSR.  Our keystore has a single privateKeyentry of
> length 2: consisting of the signed CSR and the intermediate cert (in that
> order).  The truststore has a single entry of length one: consisting of the
> root cert used to issue the intermediate.  Does anybody know if that is the
> correct setup for JKS.  This setup was given to us by another team in our
> company that uses java much more than us.
>
> Some other points to note: Cassandra-9386 issue points out that 'dc'
> internode_encryption when using Ec2MultiRegionSnitch doesn't work correctly
> (always uses encrypted connections).  But I still can't get 'all' to work.
> The way I'm trying to get it to work is by just simply flipping encryption
> on in two non-seed nodes in the same datacenter.  I notice that in
> system.log I can see them both output the message 'Handshaking with
> /private IP'.  But then a few minutes later the unable to gossip exception
> is thrown.  No other information/logs are given; so I assume the handshake
> failed? presumably b/c incorrect truststore/keystore?
>
> I can't seem to find any concrete information about how to setup the
> keystore cert chain and/or the truststore. Does anybody know of any good
> sources on this topic, or know at the top of the minds how this setup is
> supposed to be?
>
>
> On Mon, Aug 26, 2019 at 10:01 PM Subroto Barua <sbarua116@yahoo.com.invalid>
> wrote:
>
>> could be issue with keystore/trustore --- you may want to do keytool --
>> list  -- validate the files/password; also do md5sum on files from 1 node
>> in west and 1 node in east.
>> check ssl port 7001 --- from 1 node in west --> telnet <node in
>> east>:7001 (or custom port if you are not using default port)
>>
>> On Monday, August 26, 2019, 05:46:19 PM PDT, Michael Carlise
>> <mcarlise@salesforce.com.INVALID> wrote:
>>
>>
>> Subroto -
>>
>> both tools error; openssl errno 111 - which made me check bound ports on
>> the c* node with encryption flipped.  Port 9042 is not open (determined by
>> netstat -ant).  Looking at the log differences for when a node is started
>> with/without encryption.  Without encryption, I get a bunch of lines like:
>>
>> OutboundTcpConnection.java:561 - Handshaking version w/ IP
>>
>> And this happens after a line like
>>
>> Gossiper.java - Waiting for gossip to settle...
>>
>> with encryption toggled to 'dc', I don't see any of those lines;
>> presumable b/c the gossiper is trying to start but doesn't.
>>
>> On Mon, Aug 26, 2019 at 6:51 PM Subroto Barua <sbarua116@yahoo.com.invalid>
>> wrote:
>>
>> Michael,
>>
>> Are you able to connect to any c* node via OpenSSL?
>>
>> Openssl s_client -connect <ip address >:9042
>>
>> Cqlsh <ip address> —ssl
>>
>> Subroto
>>
>> On Aug 26, 2019, at 2:47 PM, Marc Selwan <marc.selwan@datastax.com>
>> wrote:
>>
>> which exact version of OpenJDK are you using? Is it possible you don't
>> have JCE on those nodes? (I believe more recent versions of Java 8 has this
>> baked in so that might not be it)
>>
>>
>> *Marc Selwan | *DataStax *| *PM, Server Team *|* *(925) 413-7079* *|*
>> Twitter <https://twitter.com/MarcSelwan>
>>
>> *  Quick links | *DataStax <http://www.datastax.com> *| *Training
>> <http://www.academy.datastax.com> *| *Documentation
>> <http://www.datastax.com/documentation/getting_started/doc/getting_started/gettingStartedIntro_r.html>
>>  *| *Downloads <http://www.datastax.com/download>
>>
>>
>>
>> On Mon, Aug 26, 2019 at 1:56 PM Michael Carlise <
>> mcarlise@salesforce.com.invalid> wrote:
>>
>>
>> I originally opened this issue on stackoverflow (
>> https://stackoverflow.com/questions/57516660/cassandra-node-to-node-encryption-throws-unable-to-gossip-with-peers-exception
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_57516660_cassandra-2Dnode-2Dto-2Dnode-2Dencryption-2Dthrows-2Dunable-2Dto-2Dgossip-2Dwith-2Dpeers-2Dexception&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=KdhQDpMbz8v1GYrbdYL_opGq-GBPXftrEYEkgcGeMp0&s=4CR8PRQopb4FyCLj8PDI44mSouBz65Yx8THnH8tOb7o&e=>
>> ).
>>
>> However, I haven't gotten any responses in over a week.  I'm going to
>> post it here and maybe someone will have an idea on where I can look.
>>
>> We currently run a multi region cassandra cluster in AWS. It runs in four
>> regions, 12 nodes per region. It runs without node to node encryption (or
>> client encryption either). We are trying to enable inter datacenter node to
>> node encryption. However, when we flip encryption over we get an exception
>> that nodes are unable to gossip with any peers.
>>
>> It could possibly be that we didn't build our jks keystore/truststores
>> correctly (more on how we built these files below). But, we additionally do
>> not see intra datacenter communication working (which should be set to
>> unencrypted communication). Additionally, cqlsh cannot connect to the node
>> either; even though we have (by default) client_auth_required set to
>> false.
>>
>> ERROR [main] 2019-08-15 18:46:32,241 CassandraDaemon.java:749 - Exception encountered
during startup
>> java.lang.RuntimeException: Unable to gossip with any peers
>>         at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1435) ~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:566)
~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:823)
~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:683)
~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.StorageService.initServer(StorageService.java:632)
~[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:388)
[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:620)
[apache-cassandra-3.11.4.jar:3.11.4]
>>         at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:732)
[apache-cassandra-3.11.4.jar:3.11.4]
>> INFO  [main] 2019-08-15 18:47:07,384 YamlConfigurationLoader.java:89 - Configuration
location: file:/etc/cassandra/cassandra.yaml
>>
>>
>> Something to note is that this error message occurs after a few minutes
>> of the node being up. (i.e. there is a delay between start up before this
>> exception is thrown).
>>
>> *Information about our cassandra setup*
>>
>> cassandra version: 3.11.4
>> JDK version: openjdk-8.
>> Linux: Ubuntu 18.04 (bionic).
>>
>> *cassandra.yaml*
>>
>> endpoint_snitch: Ec2MultiRegionSnitch
>>
>> server_encryption_options:
>>   internode_encryption: dc
>>   keystore: <omitted>
>>   keystore_password: <omitted>
>>   truststore: <omitted>
>>   truststore_password: <omitted>
>>
>> client_encryption_options:
>>   enabled: false
>>
>> *cassandra-rackdc.properties*
>>
>> prefer_local=true
>>
>> *No obvious errors with SSH output*
>>
>> When starting cassandra with JVM_OPTS="$JVM_OPTS -Djavax.net.debug=ssl" added
>> to cassandra-env.sh we see SSL logs printed to stdout (*Note: Subject
>> and Issuer were omitted on purpose)*.
>>
>> found key for : cassy-us-west-2
>> adding as trusted cert:
>>   Subject: ...
>>   Issuer:  ...
>>   Algorithm: RSA; Serial number: 0xdad28d843fc73325d4c1a75207d4e74
>>   Valid from Fri May 27 00:00:00 UTC 2016 until Tue May 26 23:59:59 UTC 2026
>>
>> ...
>>
>> trigger seeding of SecureRandom
>> done seeding SecureRandom
>>
>> Looking at Java SE SSL/TLS connection debugging
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.oracle.com_javase_7_docs_technotes_guides_security_jsse_ReadDebug.html&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=E6NVfMr2TIhW42QMfARTvsfCLtdF-oEA3KfAQRfVZdk&m=KdhQDpMbz8v1GYrbdYL_opGq-GBPXftrEYEkgcGeMp0&s=SR3ashwvSRxA75nBjGDwjAwq65nDuBZUaDOvHPGDrps&e=>,
>> this looks correct. But to note, we see this series of messages (along with
>> the RSA key signature output) repeated several times in rapid fire. We
>> never observe any messages about the trust store being added; however that
>> might be something that occurs only on client initiation (?)
>>
>> Additionally, we do see cassandra report that the Encrypted Messaging
>> service has been started.
>>
>> INFO  [main] 2019-08-15 18:45:31,022 MessagingService.java:704 - Starting Encrypted
Messaging Service on SSL port 7001
>>
>> *Doesn't appear to be a cassandra.yaml configuration problem*
>>
>> We can bring the node back online by simply configuring internode_encryption:
>> none. This action seems to rule out a broadcast_address or rpc_address
>> configuration problem.
>>
>> *How we built our keystore/truststores*
>>
>> We followed the basic template datastax docs for preparing SSL
>> certificates
>> <https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/configuration/secureSSLCertWithCA.html>.
>> One minor difference was that our private key and CSRs were generated using
>> openssl. One per each region (we plan to share key/signed certs across
>> nodes in regions). This was created using a command template as:
>>
>> openssl req -new -newkey rsa:2048 -out cassy-<region>.csr -keyout cassy-<region>.key
-config cassy-<region>.conf -subj "..." -nodes -sha256
>>
>> The generated CSR was then signed by an internal root CA. Because we
>> generated our files using openssl, we had to build our jks files by
>> importing our certs into them.
>>
>> *Commands to generate truststore*
>>
>> We distribute this one file to all nodes.
>>
>> keytool -importcert
>>     -keystore generic-server-truststore.jks
>>     -alias rootCa
>>     -file rootCa.crt
>>     -noprompt
>>     -keypass omitted
>>     -storepass omitted
>>
>> *Commands to generate keystore*
>>
>> This was done one per region; but essentially we created a keystore with
>> keytool, then deleted the key entry and then imported our key entry using
>> keytool from a pkcs12 file.
>>
>> keytool -genkeypair -keyalg RSA -alias cassy-${region} -keystore cassy-${region}.jks
-storepass omitted -keypass omitted -validity 365 -keysize 2048 -dname "..."
>>
>> keytool -delete -alias cassy-${region} -keystore cassy-${region}.jks -storepass omitted
>>
>> openssl pkcs12 -export -in signed_certs/${region}.pem -inkey keys/cassandra.${region}.key
-name cassy-${region} -out ${region}.p12
>>
>> keytool -importkeystore -deststorepass omitted -destkeystore cassy-${region}.jks
-srckeystore ${region}.p12 -srcstoretype PKCS12
>>
>> keytool -importcert -keystore cassy-${region}.jks -alias rootCa -file ca.crt -noprompt
-keypass omitted -storepass omitted
>>
>> Looking back at this, I don't remember why we used keytool to generate a
>> keypair/keystore, then deleted and imported. I think it was because the
>> keytool importkeystore command refused to run if the keystore didn't
>> already exist.
>>
>> *ca.crt and pem file*
>>
>> The ca.crt file contains the root certificate and the intermediate
>> certificate that was used to sign the CSR. The pem file contains the signed
>> CSR returned to us, the intermediate cert, and the root CA (in that order).
>>
>> *openssl verify ca.crt and pem*
>>
>> openssl verify -CAfile ca.crt us-west-2.pem
>> signed_certs/us-west-2.pem: OK
>>
>> *Command output after enabling encryption*
>>
>> *nodetool status (output truncated)*
>>
>> Datacenter: us-east
>> ===================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address         Load       Tokens       Owns (effective)  Host ID           
                   Rack
>> ?N  52.44.11.221    ?          256          25.4%             null              
                   1c
>> ...
>> ?N  52.204.232.195  ?          256          23.2%             null              
                   1d
>> Datacenter: us-west-2
>> =====================
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address         Load       Tokens       Owns (effective)  Host ID           
                   Rack
>> ?N  34.209.2.144    ?          256          26.5%             null              
                   2c
>> UN  52.40.32.177    105.99 GiB  256          23.7%             null             
                    2c
>> ?N  34.210.109.203  ?          256          24.7%             null              
                   2a
>> ...
>>
>> With the online node being the node with encryption set.
>>
>> *cqlsh to localhost*
>>
>> cassy-node6:~$ cqlsh
>> Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried
connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
>>
>> *cqlsh to remote node* Remote node is a node with encryption enabled
>>
>> cassy-node6:~$ cqlsh 10.0.2.7
>> Connection error: ('Unable to connect to any servers', {'10.0.2.7': error(111, "Tried
connecting to [('10.0.2.7', 9042)]. Last error: Connection refused")})
>>
>>

Mime
View raw message