kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Carter <michael.car...@instaclustr.com>
Subject Re: Dynamic configuration of interbroker SSL certificates
Date Tue, 13 Aug 2019 01:12:34 GMT
Hi Colin,

Thanks for the explanation. Yes, I can see the sense in that, although it was certainly a
surprise for me. I’ll just have to be a bit more careful about instantiating AdminClients
in the future.

Cheers,
Michael

> On 13 Aug 2019, at 1:36 am, Colin McCabe <cmccabe@apache.org> wrote:
> 
> Hi Michael,
> 
> The NetworkClient periodically fetches metadata so that it always knows what the cluster
topology is. This also helps it to have some open connections when needed to reduce the latency
of operations.
> 
> To be fair, we haven’t thought very much about optimizing this since the overhead seems
to be low. The update interval can also be configured.
> 
> Best,
> Colin
> 
> On Sun, Aug 11, 2019, at 16:43, Michael Carter wrote:
>> Hi Rajini,
>> 
>> I just thought I'd let you know I've sorted through my issue now. It turns out that
I was doing something silly. The SSL handshake errors weren't caused by the brokers talking
to themselves but were instead being initiated from an AdminClient I had running in a seperate
process. Ironically the same AdminClient I was using to rotate the certificates in the first
place. I was naively assuming that because I wasn't asking the AdminClient to do anything
at that exact moment, that it wouldn't be contacting the brokers. Instead, it seems the AdminClient
has an internal thread running that is periodically trying to get metadata information from
the brokers.
>> Although now I'm wondering, why is this continual requesting necessary? Is there
some reason that work isn't deferred until the AdminClient is specifically asked to do something?
>> 
>> Cheers,
>> Michael
>> 
>>> On 5 Aug 2019, at 3:33 pm, Michael Carter <michael.carter@instaclustr.com>
wrote:
>>> 
>>> Hi Rajini,
>>> 
>>> Thanks for the good suggestions. I’ve investigated further and confirmed that
the issue continues to occur even with hostname verification turned off.
>>> I put in the ssl debug option and got some interesting information. I’ve attached
an example of what the logs output when SSL errors start occurring.
>>> 
>>> From the debug output I can see that the internal IP addresses do form part of
the alternative names on the certificate, which is good.
>>> What it shows is that the verification is failing due to the fact that it is
using an out of date certificate (where the validity period has expired).
>>> 
>>> It is interesting that in the attached example, the broker appears to be trying
to connect to itself, and one end of the connection has the new, valid certificate, while
the other end has the old, out of date certificate, even though my understanding is that it
should be one and the same certificate.
>>> Does this suggest that there is some network component inside Kafka that hasn’t
responded to the reconfiguration of of interbroker Keystore property?
>>> 
>>> Cheers,
>>> Michael
>>> <debug-handshake.log>
>>> 
>>>> On 2 Aug 2019, at 7:04 pm, Rajini Sivaram <rajinisivaram@gmail.com>
wrote:
>>>> 
>>>> Hi Michael,
>>>> 
>>>> Thanks for the logs.
>>>> 
>>>> When keystore is reconfigured in a stable cluster, we wouldn't expect to
>>>> see any failures even if there is an issue with the certs since existing
>>>> connections continue to work without re-authenticating. When a broker is
>>>> restarted, there could be controller re-election, leader elections etc.
>>>> causing new connections on all brokers, so failures can prop up on all
>>>> brokers.
>>>> 
>>>> When we reconfigure inter-broker keystore, we validate the keystore against
>>>> the truststore on each broker. Since we are seeing handshake failures
>>>> within. a broker (e.g. controller connections), it suggests that the
>>>> validation performed by the actual connection is failing while our
>>>> validation during reconfiguration succeeded. The main difference between
>>>> the two is hostname verification - we disable hostname verification when
>>>> validating configs, but for actual connections this seems to be enabled in
>>>> your configs. Can you verify that your certs contain the host IP address
>>>> used in the advertised listeners list (e.g. 10.224.59.10). In order to rule
>>>> out hostname issues, you could rerun your tests with
>>>> `listener.name.interbroker.ssl.endpoint.identification.algorithm=`,
>>>> i.e. empty string to disable hostname verification.
>>>> 
>>>> If the issue still occurs with hostname verification disabled, it will be
>>>> useful to run the brokers with
>>>> KAFKA_OPTS="-Djavax.net.debug=ssl,handshake". That would show the certs
>>>> used as well as more details about the handshake failure.
>>>> 
>>>> Regards,
>>>> 
>>>> Rajini
>>>> 
>>>> 
>>>> 
>>>> On Fri, Aug 2, 2019 at 8:49 AM Michael Carter <
>>>> michael.carter@instaclustr.com> wrote:
>>>> 
>>>>> Hi Rajini,
>>>>> 
>>>>> Thanks for your response. I've attached some logs from a three node
>>>>> cluster that I was doing this on. For the purpose of this test I'm using
>>>>> certificates with ten minutes validity. The original certificate was
valid
>>>>> from roughly 05:32 to 05:42, I rotate them every five minutes, and stopped
>>>>> the third broker at 05:48 (five or so minutes after the original
>>>>> certificates would have expired). At this point the errors start to appear
>>>>> in the logs in volume. A lot of them are errors connecting to the restarted
>>>>> broker, but not all of them. Some of them are the first broker (which
>>>>> hasn't been restarted) apparently having trouble connecting to itself.
I
>>>>> restarted the other brokers at about 05:49 but the errors continued.
>>>>> 
>>>>> To answer your questions:
>>>>> 1) All brokers are dynamically updated with new certs and this works
fine
>>>>> without any errors (this just indicates that config update succeeded)
>>>>> 
>>>>> - Yes that is correct.
>>>>> 
>>>>> 2) If one broker is restarted, handshakes failures appear in the logs
and
>>>>> these are logged from SocketServer in the restarted broker. (this could
>>>>> indicate that some client-side cert was not updated dynamically)
>>>>> 
>>>>> - The errors tend to appear in all the brokers logs. I can also get the
>>>>> same thing just by shutting down one of the brokers. It doesn't actually
>>>>> need to be restarted, since the SslAuthenticationExceptions will appear
in
>>>>> the brokers that weren't shutdown or restarted regardless. In fact, I
can
>>>>> get a message saying "Failed authentication with <IP address> (SSL
>>>>> handshake failed)" when no broker is running on that IP address at all.
>>>>> Some of the messages indicate a failure of a broker connecting to itself
>>>>> (e.g. "Failed authentication with /10.224.59.10 (SSL handshake failed)"
>>>>> in the logs of the broker running on 10.224.59.10)
>>>>> 
>>>>> 3) Once all brokers are restarted, there are no more handshake failures
>>>>> (this is the actual confirmation that certs from 1) work)
>>>>> 
>>>>> - Unfortunately no, the errors persist even after this. Sometimes, it
will
>>>>> recover from this situation if I run another dynamic certificate rotation.
>>>>> But restarting the brokers doesn't seem to clear it. It might be tempting
>>>>> to think that there is something wrong with the new certificates, but
>>>>> they're generated in the exact same way as the original certificates
>>>>> (except for the validity time period), and they seem to be being served
>>>>> correctly after the dynamic rotation. (I'm checking with this command:
>>>>> openssl s_client -debug -connect 10.224.59.10:9094 2>/dev/null </dev/null
>>>>> | openssl x509 -noout -dates ). Also, subsequent updates with similar
>>>>> certificates can sometimes fix the problem.
>>>>> 
>>>>> The fact that it can persist after a restart is what really sticks out
to
>>>>> me. Is there any state stored outside of Kafka that could be getting
in the
>>>>> way? Does Zookeeper store anything other than the filename of the keystore?
>>>>> 
>>>>> Cheers,
>>>>> Michael
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 1 Aug 2019, at 7:10 pm, Rajini Sivaram <rajinisivaram@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Michael,
>>>>>> 
>>>>>> Thank you for reporting this. It is possible that we missed out some
>>>>>> client-side update in KAFKA-8336. Would it be possible for you to
provide
>>>>>> logs from multiple brokers around the time of a broker restart when
>>>>>> handshake failures started occurring so that we can try to work out
which
>>>>>> client connection is failing? Corresponding to the handshake failure
>>>>> entry
>>>>>> from SocketServer in the restarted broker, there should a failed
client
>>>>>> connection on another broker. If that has a stack trace, it would
be
>>>>> useful.
>>>>>> 
>>>>>> Also just to confirm, are these correct:
>>>>>> 
>>>>>> 1) All brokers are dynamically updated with new certs and this works
fine
>>>>>> without any errors (this just indicates that config update succeeded)
>>>>>> 2) If one broker is restarted, handshakes failures appear in the
logs and
>>>>>> these are logged from SocketServer in the restarted broker. (this
could
>>>>>> indicate that some client-side cert was not updated dynamically)
>>>>>> 3) Once all brokers are restarted, there are no more handshake failures
>>>>>> (this is the actual confirmation that certs from 1) work)
>>>>>> 
>>>>>> Thank you,
>>>>>> 
>>>>>> Rajini
>>>>>> 
>>>>>> On Thu, Aug 1, 2019 at 12:40 AM Michael Carter <
>>>>>> michael.carter@instaclustr.com> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I'm having an issue with dynamic configuration of interbroker
SSL
>>>>>>> certificates (in Kafka 2.3.0) that I'm hoping someone can give
me
>>>>> insight
>>>>>>> on. I've previously posted something similar on the Users email
list,
>>>>> but I
>>>>>>> think it will need some help from developers experienced with
how the
>>>>>>> networking code in Kafka works.
>>>>>>> 
>>>>>>> I'm trying to use SSL two-way authentication for inter broker
>>>>>>> communication, with short lived SSL certificates, rotatating
them
>>>>>>> frequently without needing to do a broker restart. So, on each
broker
>>>>> in my
>>>>>>> cluster, I periodically generate a new certificate keystore file,
and
>>>>> set
>>>>>>> the "listener.name.interbroker.ssl.keystore.location" broker
config
>>>>>>> property property. (I'm using inter.broker.listener.name=INTERBROKER)
>>>>>>> 
>>>>>>> Setting this property works fine, and everything appears ok.
And
>>>>> manually
>>>>>>> connecting to the inter broker listener shows it's correctly
serving the
>>>>>>> new certificate. But if I ever restart a broker after the original
>>>>>>> certificate has expired (The one the broker started up with,
which is no
>>>>>>> longer configured anywhere), then communication failures between
brokers
>>>>>>> start to toccur. My logs fill up with messages like this:
>>>>>>> 
>>>>>>> [2019-07-22 03:57:43,605] INFO [SocketServer brokerId=1] Failed
>>>>>>> authentication with 10.224.70.3 (SSL handshake failed)
>>>>>>> (org.apache.kafka.common.network.Selector)
>>>>>>> 
>>>>>>> A little bit of extra logging injected into the code tells me
that the
>>>>>>> failures are caused by the out of date SSL certificates being
used. So
>>>>> it
>>>>>>> seems there are some network components inside Kafka still stuck
on the
>>>>> old
>>>>>>> settings.
>>>>>>> This sounds similar to the behaviour described in KAFKA-8336
(
>>>>>>> https://issues.apache.org/jira/browse/KAFKA-8336), but this is
marked
>>>>> as
>>>>>>> fixed in 2.3.0.
>>>>>>> 
>>>>>>> I've confirmed that all the SslChannelBuilders and SslFactories
appear
>>>>> to
>>>>>>> be being reconfigured correctly when the dynamic setting is set.
I've
>>>>> tried
>>>>>>> closing all existing KafkaChannels on a reconfiguration event,
in order
>>>>> to
>>>>>>> force them to re-open with the new certificates, but the problem
>>>>> persists.
>>>>>>> 
>>>>>>> Does anyone have any idea what other components may be hanging
around,
>>>>>>> using the old certificates?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Michael
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>> 


Mime
View raw message