helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joy (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HELIX-535) Helix controller stops working with heavy configuration
Date Tue, 28 Oct 2014 23:19:33 GMT
Joy created HELIX-535:

             Summary: Helix controller stops working with heavy configuration
                 Key: HELIX-535
                 URL: https://issues.apache.org/jira/browse/HELIX-535
             Project: Apache Helix
          Issue Type: Bug
          Components: helix-core
         Environment: machine:$ uname -a
Linux eat1-app373.stg 2.6.32-220.10.1.el6.x86_64 #1 SMP Fri Mar 9 12:37:51 EST 2012 x86_64
x86_64 x86_64 GNU/Linux

JVM version: $ /export/apps/jdk/current/bin/java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)

            Reporter: Joy

The issue consistently comes up with heavy configuration: higher number of znodes, higher
number of partitions, and higher number of databases.

The goal of our tests is to evaluate the performance of helix controller (in terms of controller
latency) with increased number of nodes, databases and partitions.

In our test, we use multiple machines: one for zookeeper, one for helix controller, and the
rest are for dummy processes. The configuration is as below:
        zkr <----------> helix
      dummy processes

We intentionally kill the master dummy processes once every 30 seconds to simulate a failure
event. Everything works fine with light configuration such as: 27 nodes + 1db + 729 partitions.
However, when the configuration is heavy, such as 81 nodes + 10 databases + 81 partitions
for each db, the controller latency increases significantly after several failure events:
                  Control Latency (ms)
First event     : 182
Second event: 188
Third event:     200
Fourth Event:  193
Fifth event:      200
Sixth event:     185
Seventh event: 189
Eight event:      213
Ninth Event:     1082209

And then after this extremely long failure, the helix controller stop working. The controller
log is as attached. 

This message was sent by Atlassian JIRA

View raw message