kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eno Thereska (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-4474) Poor kafka-streams throughput
Date Thu, 01 Dec 2016 20:17:58 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15712966#comment-15712966
] 

Eno Thereska commented on KAFKA-4474:
-------------------------------------

[~jjchorrobe] Thanks for reporting. A couple of questions:
- how long is the application running for, and are there enough records? I ask because if
the run is very short (a few seconds), we won't be in steady state, and the cost of things
like partitioning rebalancing might dominate. Ideally the application (single instance or
multiple instance) should run for 60 seconds or so.
- if you observe the CPU to be completely pegged at 100%, even small effects like having 2
processes rather than 2 threads might lead to some amount of thrashing, severely degrading
the performance. Do you observe high CPU utilization ~100%? 

I'll try to answer your questions once I understand the above a bit better. One problem with
running everything locally is that lots of different things end up mixed up. For example,
partitions are used as a unit of storage parallelism, but in this case all 4 partitions are
in the same local disk. In an ideal experiment, the 4 partitions would be in 4 different disks.
Also, the fact that zookeeper and the kafka broker are on the same machine (that's my understanding
of your setup, correct me if I'm wrong) further perturbs the measurements since they consume
quite a bit of CPU as well, potentially adding to thrashing. Is there a way you can put the
Kafka cluster on a separate machine? If not, we'll work with what you have, but it is not
an ideal setup.

> Poor kafka-streams throughput
> -----------------------------
>
>                 Key: KAFKA-4474
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4474
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 0.10.1.0
>            Reporter: Juan Chorro
>            Assignee: Eno Thereska
>
> Hi! 
> I'm writing because I have a worry about kafka-streams throughput.
> I have only a kafka-streams application instance that consumes from 'input' topic, prints
on the screen and produces in 'output' topic. All topics have 4 partitions. As can be observed
the topology is very simple.
> I produce 120K messages/second to 'input' topic, when I measure the 'output' topic I
detect that I'm receiving ~4K messages/second. I had next configuration (Remaining parameters
by default):
> application.id: myApp
> bootstrap.servers: localhost:9092
> zookeeper.connect: localhost:2181
> num.stream.threads: 1
> I was doing proofs and tests without success, but when I created a new 'input' topic
with 1 partition (Maintain 'output' topic with 4 partitions) I got in 'output' topic 120K
messages/seconds.
> I have been doing some performance tests and proof with next cases (All topics have 4
partitions in all cases):
> Case A - 1 Instance:
> - With num.stream.threads set to 1 I had ~3785 messages/second
> - With num.stream.threads set to 2 I had ~3938 messages/second
> - With num.stream.threads set to 4 I had ~120K messages/second
> Case B - 2 Instances:
> - With num.stream.threads set to 1 I had ~3930 messages/second for each instance (And
throughput ~8K messages/second)
> - With num.stream.threads set to 2 I had ~3945 messages/second for each instance (And
more or less same throughput that with num.stream.threads set to 1)
> Case C - 4 Instances
> - With num.stream.threads set to 1 I had 3946 messages/seconds for each instance (And
throughput ~17K messages/second):
> As can be observed when num.stream.threads is set to #partitions I have best results.
Then I have next questions:
> - Why whether I have a topic with #partitions > 1 and with num.streams.threads is
set to 1 I have ~4K messages/second always?
> - In case C. 4 instances with num.stream.threads set to 1 should be better that 1 instance
with num.stream.threads set to 4. Is corrects this supposition?
> This is the kafka-streams application that I use: https://gist.github.com/Chorro/5522ec4acd1a005eb8c9663da86f5a18



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message