flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruno Aranda <brunoara...@gmail.com>
Subject Re: One Kafka consumer operator subtask sending many more bytes than the others
Date Fri, 17 Feb 2017 09:56:50 GMT
Cool, thanks for your feedback Till. I will investigate the logs and our
Kafka installation. So far we use Flink 1.2 with Kafka on AWS.
Flink client is 0.10 (with underlying java client Will have a
look at the logs and try different things with Kafka today (ie upgrade it,
configure the availability zones, etc), to see if I can pinpoint the issue.



On Fri, 17 Feb 2017 at 09:44 Till Rohrmann <trohrmann@apache.org> wrote:

> Hi Bruno,
> do you resume from the savepoints with a changed parallelism? Which
> version of Flink are you running? Which Kafka version are you using (+
> which Kafka consumer version).
> The partitions should be distributed among all 6 parallel subtasks which
> means that every subtask should read from 100 partitions. You can check in
> the logs how many partitions each consumer has been assigned by searching
> for "Got x partitions from these topics" or "Consumer is going to read the
> following topics". If you don't see 100 partitions assigned to each
> consumer, then something is clearly going wrong.
> Cheers,
> Till
> On Thu, Feb 16, 2017 at 1:42 PM, Bruno Aranda <baranda@apache.org> wrote:
> Hi,
> I am trying to understand this issue in one of my Flink jobs with a Kafka
> source. At the moment I am using parallelism 6. The operator subtask just
> read the Kafka records (from a topic with 600 partitions), applies a keyBy
> and sends them to the next operator after the hashing.
> What I can see is that one of the subtasks is sending much more data than
> the others. The kafka partitions are well-balanced, as in they roughly
> contain the same number of records.
> I attach a screenshot. One subtask has sent 429 MB and the others around
> 4.88 KB.
> Could this be related that I originally created this stream with just
> parallelism one and I increased it later at some point. This is from a
> performance test suite where I am increasing parallelism between runs to
> check the performance, in a highly available cluster setup with zookeeper
> and checkpoints in S3.
> I am not cancelling the job with savepoints.
> Any ideas or things I can look for?
> Thanks!
> Bruno
> [image: Screen Shot 2017-02-16 at 12.36.29.png]

View raw message