kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwen Shapira <g...@confluent.io>
Subject Re: [VOTE] 0.10.0.0 RC4
Date Mon, 16 May 2016 06:12:49 GMT
Thanks, man!

Good to see Heroku being good friends to the Kafka community by
testing new releases, reporting issues and following up with the cause
and documentation pr.

With this out of the way, we closed all known blockers for 0.10.0.0.
I'll roll out a new RC tomorrow morning.

Gwen

On Sun, May 15, 2016 at 1:27 PM, Tom Crayford <tcrayford@heroku.com> wrote:
> https://github.com/apache/kafka/pull/1389
>
> On Sun, May 15, 2016 at 9:22 PM, Ismael Juma <ismael@juma.me.uk> wrote:
>
>> Hi Tom,
>>
>> Great to hear that the failure testing scenario went well. :)
>>
>> Your suggested improvement sounds good to me and a PR would be great. For
>> this kind of change, you can skip the JIRA, just prefix the PR title with
>> `MINOR:`.
>>
>> Thanks,
>> Ismael
>>
>> On Sun, May 15, 2016 at 9:17 PM, Tom Crayford <tcrayford@heroku.com>
>> wrote:
>>
>> > How about this?
>> >
>> >     <b>Note:</b> Due to the additional timestamp introduced in each
>> message
>> > (8 bytes of data), producers sending small messages may see a
>> >     message throughput degradation because of the increased overhead.
>> > Likewise, replication now transmits an additional 8 bytes per message.
>> >     If you're running close to the network capacity of your cluster, it's
>> > possible that you'll overwhelm the network cards and see failures and
>> > performance
>> >     issues due to the overload.
>> >     When receiving compressed messages, 0.10.0
>> >     brokers avoid recompressing the messages, which in general reduces
>> the
>> > latency and improves the throughput. In
>> >     certain cases, this may reduce the batching size on the producer,
>> which
>> > could lead to worse throughput. If this
>> >     happens, users can tune linger.ms and batch.size of the producer for
>> > better throughput.
>> >
>> > Would you like a Jira/PR with this kind of change so we can discuss them
>> in
>> > a more convenient format?
>> >
>> > Re our failure testing scenario: Kafka 0.10 RC behaves exactly the same
>> > under failure as 0.9 - the controller typically shifts the leader in
>> around
>> > 2 seconds or so, and the benchmark sees a small drop in throughput during
>> > that, then another drop whilst the replacement broker comes back to
>> speed.
>> > So, overall we're extremely happy and excited for this release! Thanks to
>> > the committers and maintainers for all their hard work.
>> >
>> > On Sun, May 15, 2016 at 9:03 PM, Ismael Juma <ismael@juma.me.uk> wrote:
>> >
>> > > Hi Tom,
>> > >
>> > > Thanks for the update and for all the testing you have done! No worries
>> > > about the chase here, I'd much rather have false positives by people
>> who
>> > > are validating the releases than false negatives because people don't
>> > > validate the releases. :)
>> > >
>> > > The upgrade note we currently have follows:
>> > >
>> > > https://github.com/apache/kafka/blob/0.10.0/docs/upgrade.html#L67
>> > >
>> > > Please feel free to suggest improvements.
>> > >
>> > > Thanks,
>> > > Ismael
>> > >
>> > > On Sun, May 15, 2016 at 6:39 PM, Tom Crayford <tcrayford@heroku.com>
>> > > wrote:
>> > >
>> > > > I've been digging into this some more. It seems like this may have
>> been
>> > > an
>> > > > issue with benchmarks maxing out the network card - under 0.10.0.0-RC
>> > the
>> > > > slightly additional bandwidth per message seems to have pushed the
>> > > broker's
>> > > > NIC into overload territory where it starts dropping packets
>> (verified
>> > > with
>> > > > ifconfig on each broker). This leads to it not being able to talk
to
>> > > > Zookeeper properly, which leads to OfflinePartitions, which then
>> causes
>> > > > issues with the benchmarks validity, as throughput drops a lot when
>> > > brokers
>> > > > are flapping in and out of being online. 0.9.0.1 doing that 8 bytes
>> > less
>> > > > per message means the broker's NIC can sustain more messages/s. There
>> > was
>> > > > an "alignment" issue with the benchmarks here - under 0.9 we were
>> > *just*
>> > > at
>> > > > the barrier of the broker's NICs sustaining traffic, and under 0.10
>> we
>> > > > pushed over that (at 1.5 million messages/s, 8 bytes extra per
>> message
>> > is
>> > > > an extra 36 MB/s with replication factor 3 [if my math is right, and
>> > > that's
>> > > > before SSL encryption which may be additional overhead], which is
as
>> > much
>> > > > as an additional producer machine).
>> > > >
>> > > > The dropped packets and the flapping weren't causing notable timeout
>> > > issues
>> > > > in the producer, but looking at the metrics on the brokers, offline
>> > > > partitions was clearly triggered and undergoing, and the broker logs
>> > show
>> > > > ZK session timeouts. This is consistent with earlier benchmarking
>> > > > experience - the number of producers we were running under 0.9.0.1
>> was
>> > > > carefully selected to be just under the limit here.
>> > > >
>> > > > The other issue with the benchmark where I reported an issue between
>> > two
>> > > > single producers was caused by a "performance of producer machine"
>> > issue
>> > > > that I wasn't properly aware of. Apologies there.
>> > > >
>> > > > I've done benchmarks now where I limit the producer throughput (via
>> > > > --throughput) to slightly below what the NICs can sustain and seen
no
>> > > > notable performance or stability difference between 0.10 and 0.9.0.1
>> as
>> > > > long as you stay under the limits of the network interfaces. All of
>> the
>> > > > clusters I have tested happily keep up a benchmark at this rate for
6
>> > > hours
>> > > > under both 0.9.0.1 and 0.10.0.0. I've also verified that our clusters
>> > are
>> > > > entirely network bound in these producer benchmarking scenarios -
the
>> > > disks
>> > > > and CPU/memory have a bunch of remaining capacity.
>> > > >
>> > > > This was pretty hard to verify fully, which is why I've taken so long
>> > to
>> > > > reply. All in all I think the result here is expected and not a
>> blocker
>> > > for
>> > > > release, but a good thing to note on upgrades - if folk are running
>> at
>> > > the
>> > > > limit of their network cards (which you never want to do anyway, but
>> > > > benchmarking scenarios often uncover those limits), they'll see
>> issues
>> > > due
>> > > > to increased replication and producer traffic under 0.10.0.0.
>> > > >
>> > > > Apologies for the chase here - this distinctly seemed like a real
>> issue
>> > > and
>> > > > one I (and I think everybody else) would have wanted to block the
>> > release
>> > > > on. I'm going to move onto our "failure" testing, in which we run
the
>> > > same
>> > > > performance benchmarks whilst causing a hard kill on the node. We've
>> > seen
>> > > > very good results for that under 0.9 and hopefully they'll continue
>> > under
>> > > > 0.10.
>> > > >
>> > > > On Sat, May 14, 2016 at 1:33 AM, Gwen Shapira <gwen@confluent.io>
>> > wrote:
>> > > >
>> > > > > also, perhaps sharing the broker configuration? maybe this will
>> > > > > provide some hints...
>> > > > >
>> > > > > On Fri, May 13, 2016 at 5:31 PM, Ismael Juma <ismael@juma.me.uk>
>> > > wrote:
>> > > > > > Thanks Tom. I just wanted to share that I have been unable
to
>> > > reproduce
>> > > > > > this so far. Please feel free to share whatever you information
>> you
>> > > > have
>> > > > > so
>> > > > > > far when you have a chance, don't feel that you need to
have all
>> > the
>> > > > > > answers.
>> > > > > >
>> > > > > > Ismael
>> > > > > >
>> > > > > > On Fri, May 13, 2016 at 7:32 PM, Tom Crayford <
>> > tcrayford@heroku.com>
>> > > > > wrote:
>> > > > > >
>> > > > > >> I've been investigating this pretty hard since I first
noticed
>> it.
>> > > > Right
>> > > > > >> now I have more avenues for investigation than I can
shake a
>> stick
>> > > at,
>> > > > > and
>> > > > > >> am also dealing with several other things in flight/on
fire.
>> I'll
>> > > > > respond
>> > > > > >> when I have more information and can confirm things.
>> > > > > >>
>> > > > > >> On Fri, May 13, 2016 at 6:30 PM, Becket Qin <
>> becket.qin@gmail.com
>> > >
>> > > > > wrote:
>> > > > > >>
>> > > > > >> > Tom,
>> > > > > >> >
>> > > > > >> > Maybe it is mentioned and I missed. I am wondering
if you see
>> > > > > performance
>> > > > > >> > degradation on the consumer side when TLS is used?
This could
>> > help
>> > > > us
>> > > > > >> > understand whether the issue is only producer related
or TLS
>> in
>> > > > > general.
>> > > > > >> >
>> > > > > >> > Thanks,
>> > > > > >> >
>> > > > > >> > Jiangjie (Becket) Qin
>> > > > > >> >
>> > > > > >> > On Fri, May 13, 2016 at 6:19 AM, Tom Crayford <
>> > > tcrayford@heroku.com
>> > > > >
>> > > > > >> > wrote:
>> > > > > >> >
>> > > > > >> > > Ismael,
>> > > > > >> > >
>> > > > > >> > > Thanks. I'm writing up an issue with some
new findings since
>> > > > > yesterday
>> > > > > >> > > right now.
>> > > > > >> > >
>> > > > > >> > > Thanks
>> > > > > >> > >
>> > > > > >> > > Tom
>> > > > > >> > >
>> > > > > >> > > On Fri, May 13, 2016 at 1:06 PM, Ismael Juma
<
>> > ismael@juma.me.uk
>> > > >
>> > > > > >> wrote:
>> > > > > >> > >
>> > > > > >> > > > Hi Tom,
>> > > > > >> > > >
>> > > > > >> > > > That's because JIRA is in lockdown due
to excessive spam.
>> I
>> > > have
>> > > > > >> added
>> > > > > >> > > you
>> > > > > >> > > > as a contributor in JIRA and you should
be able to file a
>> > > ticket
>> > > > > now.
>> > > > > >> > > >
>> > > > > >> > > > Thanks,
>> > > > > >> > > > Ismael
>> > > > > >> > > >
>> > > > > >> > > > On Fri, May 13, 2016 at 12:17 PM, Tom
Crayford <
>> > > > > tcrayford@heroku.com
>> > > > > >> >
>> > > > > >> > > > wrote:
>> > > > > >> > > >
>> > > > > >> > > > > Ok, I don't seem to be able to file
a new Jira issue at
>> > all.
>> > > > Can
>> > > > > >> > > somebody
>> > > > > >> > > > > check my permissions on Jira? My
user is
>> > `tcrayford-heroku`
>> > > > > >> > > > >
>> > > > > >> > > > > Tom Crayford
>> > > > > >> > > > > Heroku Kafka
>> > > > > >> > > > >
>> > > > > >> > > > > On Fri, May 13, 2016 at 12:24 AM,
Jun Rao <
>> > jun@confluent.io
>> > > >
>> > > > > >> wrote:
>> > > > > >> > > > >
>> > > > > >> > > > > > Tom,
>> > > > > >> > > > > >
>> > > > > >> > > > > > We don't have a CSV metrics
reporter in the producer
>> > right
>> > > > > now.
>> > > > > >> The
>> > > > > >> > > > > metrics
>> > > > > >> > > > > > will be available in jmx. You
can find out the details
>> > in
>> > > > > >> > > > > >
>> > > > > >>
>> > http://kafka.apache.org/documentation.html#new_producer_monitoring
>> > > > > >> > > > > >
>> > > > > >> > > > > > Thanks,
>> > > > > >> > > > > >
>> > > > > >> > > > > > Jun
>> > > > > >> > > > > >
>> > > > > >> > > > > > On Thu, May 12, 2016 at 3:08
PM, Tom Crayford <
>> > > > > >> > tcrayford@heroku.com>
>> > > > > >> > > > > > wrote:
>> > > > > >> > > > > >
>> > > > > >> > > > > > > Yep, I can try those particular
commits tomorrow.
>> > > Before I
>> > > > > try
>> > > > > >> a
>> > > > > >> > > > > bisect,
>> > > > > >> > > > > > > I'm going to replicate
with a less intensive to
>> > iterate
>> > > on
>> > > > > >> > smaller
>> > > > > >> > > > > scale
>> > > > > >> > > > > > > perf test.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Jun, inline:
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > On Thursday, 12 May 2016,
Jun Rao <jun@confluent.io
>> >
>> > > > wrote:
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > > Tom,
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > Thanks for reporting
this. A few quick comments.
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > 1. Did you send the
right command for
>> producer-perf?
>> > > The
>> > > > > >> > command
>> > > > > >> > > > > limits
>> > > > > >> > > > > > > the
>> > > > > >> > > > > > > > throughput to 100
msgs/sec. So, not sure how a
>> > single
>> > > > > >> producer
>> > > > > >> > > can
>> > > > > >> > > > > get
>> > > > > >> > > > > > > 75K
>> > > > > >> > > > > > > > msgs/sec.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Ah yep, wrong commands.
I'll get the right one
>> > tomorrow.
>> > > > > Sorry,
>> > > > > >> > was
>> > > > > >> > > > > > > interpolating variables
into a shell script.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > 2. Could you collect
some stats (e.g. average
>> batch
>> > > > size)
>> > > > > in
>> > > > > >> > the
>> > > > > >> > > > > > producer
>> > > > > >> > > > > > > > and see if there
is any noticeable difference
>> > between
>> > > > 0.9
>> > > > > and
>> > > > > >> > > 0.10?
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > That'd just be hooking
up the CSV metrics reporter
>> > > right?
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > 3. Is the broker-to-broker
communication also on
>> > SSL?
>> > > > > Could
>> > > > > >> you
>> > > > > >> > > do
>> > > > > >> > > > > > > another
>> > > > > >> > > > > > > > test with replication
factor 1 and see if you
>> still
>> > > see
>> > > > > the
>> > > > > >> > > > > > degradation?
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Interbroker replication
is always SSL in all test
>> runs
>> > > so
>> > > > > far.
>> > > > > >> I
>> > > > > >> > > can
>> > > > > >> > > > > try
>> > > > > >> > > > > > > with replication factor
1 tomorrow.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > Finally, email is
probably not the best way to
>> > discuss
>> > > > > >> > > performance
>> > > > > >> > > > > > > results.
>> > > > > >> > > > > > > > If you have more
of them, could you create a jira
>> > and
>> > > > > attach
>> > > > > >> > your
>> > > > > >> > > > > > > findings
>> > > > > >> > > > > > > > there?
>> > > > > >> > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > Yep. I only wrote the
email because JIRA was in
>> > lockdown
>> > > > > mode
>> > > > > >> > and I
>> > > > > >> > > > > > > couldn't create new issues.
>> > > > > >> > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > Thanks,
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > Jun
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > On Thu, May 12, 2016
at 1:26 PM, Tom Crayford <
>> > > > > >> > > > tcrayford@heroku.com
>> > > > > >> > > > > > > > <javascript:;>>
wrote:
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > > > > We've started
running our usual suite of
>> > performance
>> > > > > tests
>> > > > > >> > > > against
>> > > > > >> > > > > > > Kafka
>> > > > > >> > > > > > > > > 0.10.0.0 RC.
These tests orchestrate multiple
>> > > > > >> > consumer/producer
>> > > > > >> > > > > > > machines
>> > > > > >> > > > > > > > to
>> > > > > >> > > > > > > > > run a fairly
normal mixed workload of producers
>> > and
>> > > > > >> consumers
>> > > > > >> > > > (each
>> > > > > >> > > > > > > > > producer/consumer
are just instances of kafka's
>> > > > inbuilt
>> > > > > >> > > > > > > consumer/producer
>> > > > > >> > > > > > > > > perf tests).
We've found about a 33% performance
>> > > drop
>> > > > in
>> > > > > >> the
>> > > > > >> > > > > producer
>> > > > > >> > > > > > > if
>> > > > > >> > > > > > > > > TLS is used
(compared to 0.9.0.1)
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > We've seen notable
producer performance
>> > degredations
>> > > > > >> between
>> > > > > >> > > > > 0.9.0.1
>> > > > > >> > > > > > > and
>> > > > > >> > > > > > > > > 0.10.0.0 RC.
We're running as of the commit
>> > 9404680
>> > > > > right
>> > > > > >> > now.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > Our specific
test case runs Kafka on 8 EC2
>> > machines,
>> > > > > with
>> > > > > >> > > > enhanced
>> > > > > >> > > > > > > > > networking.
Nothing is changed between the
>> > > instances,
>> > > > > and
>> > > > > >> > I've
>> > > > > >> > > > > > > reproduced
>> > > > > >> > > > > > > > > this over 4
different sets of clusters now.
>> We're
>> > > > seeing
>> > > > > >> > about
>> > > > > >> > > a
>> > > > > >> > > > > 33%
>> > > > > >> > > > > > > > > performance
drop between 0.9.0.1 and 0.10.0.0 as
>> > of
>> > > > > commit
>> > > > > >> > > > 9404680.
>> > > > > >> > > > > > > > Please
>> > > > > >> > > > > > > > > to note that
this doesn't match up with
>> > > > > >> > > > > > > > >
>> https://issues.apache.org/jira/browse/KAFKA-3565,
>> > > > > because
>> > > > > >> > our
>> > > > > >> > > > > > > > performance
>> > > > > >> > > > > > > > > tests are with
compression off, and this seems
>> to
>> > be
>> > > > an
>> > > > > TLS
>> > > > > >> > > only
>> > > > > >> > > > > > issue.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > Under 0.10.0-rc4,
we see an 8 node cluster with
>> > > > > replication
>> > > > > >> > > > factor
>> > > > > >> > > > > of
>> > > > > >> > > > > > > 3,
>> > > > > >> > > > > > > > > and 13 producers
max out at around 1 million 100
>> > > byte
>> > > > > >> > messages
>> > > > > >> > > a
>> > > > > >> > > > > > > second.
>> > > > > >> > > > > > > > > Under 0.9.0.1,
the same cluster does 1.5 million
>> > > > > messages a
>> > > > > >> > > > second.
>> > > > > >> > > > > > > Both
>> > > > > >> > > > > > > > > tests were with
TLS on. I've reproduced this on
>> > > > multiple
>> > > > > >> > > clusters
>> > > > > >> > > > > now
>> > > > > >> > > > > > > (5
>> > > > > >> > > > > > > > or
>> > > > > >> > > > > > > > > so of each version)
to account for the inherent
>> > > > > performance
>> > > > > >> > > > > variance
>> > > > > >> > > > > > of
>> > > > > >> > > > > > > > > EC2. There's
no notable performance difference
>> > > without
>> > > > > TLS
>> > > > > >> on
>> > > > > >> > > > these
>> > > > > >> > > > > > > runs
>> > > > > >> > > > > > > > -
>> > > > > >> > > > > > > > > it appears to
be an TLS regression entirely.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > A single producer
with TLS under 0.10 does about
>> > 75k
>> > > > > >> > > messages/s.
>> > > > > >> > > > > > Under
>> > > > > >> > > > > > > > > 0.9.0.01 it
does around 120k messages/s.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > The exact producer-perf
line we're using is
>> this:
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > bin/kafka-producer-perf-test
--topic "bench"
>> > > > > --num-records
>> > > > > >> > > > > > "500000000"
>> > > > > >> > > > > > > > > --record-size
"100" --throughput "100"
>> > > > --producer-props
>> > > > > >> > > acks="-1"
>> > > > > >> > > > > > > > > bootstrap.servers=REDACTED
>> > > > > ssl.keystore.location=client.jks
>> > > > > >> > > > > > > > > ssl.keystore.password=REDACTED
>> > > > > >> > > ssl.truststore.location=server.jks
>> > > > > >> > > > > > > > > ssl.truststore.password=REDACTED
>> > > > > >> > > > > > > > > ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
>> > > > > >> > > security.protocol=SSL
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > We're using
the same setup, machine type etc for
>> > > each
>> > > > > test
>> > > > > >> > run.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > We've tried
using both 0.9.0.1 producers and
>> > > 0.10.0.0
>> > > > > >> > producers
>> > > > > >> > > > and
>> > > > > >> > > > > > the
>> > > > > >> > > > > > > > TLS
>> > > > > >> > > > > > > > > performance
impact was there for both.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > I've glanced
over the code between 0.9.0.1 and
>> > > > 0.10.0.0
>> > > > > and
>> > > > > >> > > > haven't
>> > > > > >> > > > > > > seen
>> > > > > >> > > > > > > > > anything that
seemed to have this kind of
>> impact -
>> > > > > indeed
>> > > > > >> the
>> > > > > >> > > TLS
>> > > > > >> > > > > > code
>> > > > > >> > > > > > > > > doesn't seem
to have changed much between
>> 0.9.0.1
>> > > and
>> > > > > >> > 0.10.0.0.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > Any thoughts?
Should I file an issue and see
>> about
>> > > > > >> > reproducing
>> > > > > >> > > a
>> > > > > >> > > > > more
>> > > > > >> > > > > > > > > minimal test
case?
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > > > I don't think
this is related to
>> > > > > >> > > > > > > > >
>> https://issues.apache.org/jira/browse/KAFKA-3565
>> > -
>> > > > > that is
>> > > > > >> > for
>> > > > > >> > > > > > > > compression
>> > > > > >> > > > > > > > > on and plaintext,
and this is for TLS only.
>> > > > > >> > > > > > > > >
>> > > > > >> > > > > > > >
>> > > > > >> > > > > > >
>> > > > > >> > > > > >
>> > > > > >> > > > >
>> > > > > >> > > >
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>

Mime
View raw message