kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Onur Karaman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-5857) Excessive heap usage on controller node during reassignment
Date Fri, 08 Sep 2017 17:01:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158933#comment-16158933

Onur Karaman commented on KAFKA-5857:

I wouldn't be surprised if there were no attempts so far at making the controller memory-efficient.

There's a slight chance I may have coincidentally ran into the same issue yesterday while
preparing for an upcoming talk. I tried timing how long it takes to complete a reassignment
with many empty partitions and noticed that progress eventually halted and the controller
hit OOM.

Here's my setup on my laptop:
> rm -rf /tmp/zookeeper/ /tmp/kafka-logs* logs*
> ./gradlew clean jar
> ./bin/zookeeper-server-start.sh config/zookeeper.properties
> export LOG_DIR=logs0 && ./bin/kafka-server-start.sh config/server0.properties
> ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t --partitions 5000
--replication-factor 1
> export LOG_DIR=logs1 && ./bin/kafka-server-start.sh config/server1.properties
> python
import json
with open("reassignment.txt", "w") as f:
  reassignment = {"version":1, "partitions": [{"topic": "t", "partition": partition, "replicas":
[0, 1]} for partition in range(5000)]}
  json.dump(reassignment, f, separators=(',',':'))
> ./zkCli.sh -server localhost:2181
> create /admin/reassign_partitions <json here>

Note that I had to use the zkCli.sh that comes with zookeeper just to write the reassignment
into zk. Kafka's kafka-reassign-partitions.sh gets stuck before writing to zookeeper and zookeeper-shell.sh
seems to hang while copying the reassignment into the command.

Below are my broker configs:
> cat config/server0.properties
[09:57:16] okaraman@okaraman-mn3:~/code/kafka
> cat config/server1.properties

I haven't looked into the cause of the OOM. I ran the scenario again just now and found that
the controller spent a significant amount of time in G1 Old Gen GC.

> Excessive heap usage on controller node during reassignment
> -----------------------------------------------------------
>                 Key: KAFKA-5857
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5857
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions:
>         Environment: CentOs 7, Java 1.8
>            Reporter: Raoufeh Hashemian
>              Labels: reliability
>             Fix For: 1.1.0
>         Attachments: CPU.png, disk_write_x.png, memory.png, reassignment_plan.txt
> I was trying to expand our kafka cluster of 6 broker nodes to 12 broker nodes. 
> Before expansion, we had a single topic with 960 partitions and a replication factor
of 3. So each node had 480 partitions. The size of data in each node was 3TB . 
> To do the expansion, I submitted a partition reassignment plan (see attached file for
the current/new assignments). The plan was optimized to minimize data movement and be rack
> When I submitted the plan, it took approximately 3 hours for moving data from old to
new nodes to complete. After that, it started deleting source partitions (I say this based
on the number of file descriptors) and rebalancing leaders which has not been successful.
Meanwhile, the heap usage in the controller node started to go up with a large slope (along
with long GC times) and it took 5 hours for the controller to go out of memory and another
controller started to have the same behaviour for another 4 hours. At this time the zookeeper
ran out of disk and the service stopped.
> To recover from this condition:
> 1) Removed zk logs to free up disk and restarted all 3 zk nodes
> 2) Deleted /kafka/admin/reassign_partitions node from zk
> 3) Had to do unclean restarts of kafka service on oom controller nodes which took 3 hours
to complete  . After this stage there was still 676 under replicated partitions.
> 4) Do a clean restart on all 12 broker nodes.
> After step 4 , number of under replicated nodes went to 0.
> So I was wondering if this memory footprint from controller is expected for 1k partitions
? Did we do sth wrong or it is a bug?
> Attached are some resource usage graph during this 30 hours event and the reassignment
plan. I'll try to add log files as well

This message was sent by Atlassian JIRA

View raw message