Mailing-List: contact commits-help@helix.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@helix.apache.org
Date: Tue, 28 Oct 2014 23:19:33 +0000 (UTC)
From: "Joy (JIRA)" <jira@apache.org>
To: commits@helix.incubator.apache.org
Message-ID: <JIRA.12751267.1414538336000.358992.1414538373965@Atlassian.JIRA>
In-Reply-To: <JIRA.12751267.1414538336000@Atlassian.JIRA>
References: <JIRA.12751267.1414538336000@Atlassian.JIRA>
 <JIRA.12751267.1414538336220@arcas>
Subject: [jira] [Created] (HELIX-535) Helix controller stops working with
 heavy configuration
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Joy created HELIX-535:
-------------------------

             Summary: Helix controller stops working with heavy configuration
                 Key: HELIX-535
                 URL: https://issues.apache.org/jira/browse/HELIX-535
             Project: Apache Helix
          Issue Type: Bug
          Components: helix-core
         Environment: machine:$ uname -a
Linux eat1-app373.stg 2.6.32-220.10.1.el6.x86_64 #1 SMP Fri Mar 9 12:37:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

JVM version: $ /export/apps/jdk/current/bin/java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)

            Reporter: Joy


The issue consistently comes up with heavy configuration: higher number of znodes, higher number of partitions, and higher number of databases.

The goal of our tests is to evaluate the performance of helix controller (in terms of controller latency) with increased number of nodes, databases and partitions.

In our test, we use multiple machines: one for zookeeper, one for helix controller, and the rest are for dummy processes. The configuration is as below:
        zkr <----------> helix
         ^
         |
        V
      dummy processes

We intentionally kill the master dummy processes once every 30 seconds to simulate a failure event. Everything works fine with light configuration such as: 27 nodes + 1db + 729 partitions. However, when the configuration is heavy, such as 81 nodes + 10 databases + 81 partitions for each db, the controller latency increases significantly after several failure events:
                  Control Latency (ms)
First event     : 182
Second event: 188
Third event:     200
Fourth Event:  193
Fifth event:      200
Sixth event:     185
Seventh event: 189
Eight event:      213
Ninth Event:     1082209

And then after this extremely long failure, the helix controller stop working. The controller log is as attached. 


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)