helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HELIX-535) Helix controller stops working with heavy configuration
Date Tue, 28 Oct 2014 23:25:35 GMT

     [ https://issues.apache.org/jira/browse/HELIX-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joy updated HELIX-535:
----------------------
    Attachment: xaf
                xae
                xad
                xac
                xab
                xaa

The controller log is split into 6 files to workaround the size limit

> Helix controller stops working with heavy configuration
> -------------------------------------------------------
>
>                 Key: HELIX-535
>                 URL: https://issues.apache.org/jira/browse/HELIX-535
>             Project: Apache Helix
>          Issue Type: Bug
>          Components: helix-core
>         Environment: machine:$ uname -a
> Linux eat1-app373.stg 2.6.32-220.10.1.el6.x86_64 #1 SMP Fri Mar 9 12:37:51 EST 2012 x86_64
x86_64 x86_64 GNU/Linux
> JVM version: $ /export/apps/jdk/current/bin/java -version
> java version "1.6.0_21"
> Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
>            Reporter: Joy
>         Attachments: xaa, xab, xac, xad, xae, xaf
>
>
> The issue consistently comes up with heavy configuration: higher number of znodes, higher
number of partitions, and higher number of databases.
> The goal of our tests is to evaluate the performance of helix controller (in terms of
controller latency) with increased number of nodes, databases and partitions.
> In our test, we use multiple machines: one for zookeeper, one for helix controller, and
the rest are for dummy processes. The configuration is as below:
>         zkr <----------> helix
>          ^
>          |
>         V
>       dummy processes
> We intentionally kill the master dummy processes once every 30 seconds to simulate a
failure event. Everything works fine with light configuration such as: 27 nodes + 1db + 729
partitions. However, when the configuration is heavy, such as 81 nodes + 10 databases + 81
partitions for each db, the controller latency increases significantly after several failure
events:
>                   Control Latency (ms)
> First event     : 182
> Second event: 188
> Third event:     200
> Fourth Event:  193
> Fifth event:      200
> Sixth event:     185
> Seventh event: 189
> Eight event:      213
> Ninth Event:     1082209
> And then after this extremely long failure, the helix controller stop working. The controller
log is as attached. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message