Return-Path: X-Original-To: apmail-helix-commits-archive@minotaur.apache.org Delivered-To: apmail-helix-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D18BC17ECE for ; Tue, 28 Oct 2014 23:19:57 +0000 (UTC) Received: (qmail 79352 invoked by uid 500); 28 Oct 2014 23:19:57 -0000 Delivered-To: apmail-helix-commits-archive@helix.apache.org Received: (qmail 79318 invoked by uid 500); 28 Oct 2014 23:19:57 -0000 Mailing-List: contact commits-help@helix.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@helix.apache.org Delivered-To: mailing list commits@helix.apache.org Received: (qmail 79294 invoked by uid 99); 28 Oct 2014 23:19:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Oct 2014 23:19:57 +0000 X-ASF-Spam-Status: No, hits=-2000.6 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Tue, 28 Oct 2014 23:19:35 +0000 Received: (qmail 78487 invoked by uid 99); 28 Oct 2014 23:19:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Oct 2014 23:19:34 +0000 Date: Tue, 28 Oct 2014 23:19:33 +0000 (UTC) From: "Joy (JIRA)" To: commits@helix.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HELIX-535) Helix controller stops working with heavy configuration MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org Joy created HELIX-535: ------------------------- Summary: Helix controller stops working with heavy configuration Key: HELIX-535 URL: https://issues.apache.org/jira/browse/HELIX-535 Project: Apache Helix Issue Type: Bug Components: helix-core Environment: machine:$ uname -a Linux eat1-app373.stg 2.6.32-220.10.1.el6.x86_64 #1 SMP Fri Mar 9 12:37:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux JVM version: $ /export/apps/jdk/current/bin/java -version java version "1.6.0_21" Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) Reporter: Joy The issue consistently comes up with heavy configuration: higher number of znodes, higher number of partitions, and higher number of databases. The goal of our tests is to evaluate the performance of helix controller (in terms of controller latency) with increased number of nodes, databases and partitions. In our test, we use multiple machines: one for zookeeper, one for helix controller, and the rest are for dummy processes. The configuration is as below: zkr <----------> helix ^ | V dummy processes We intentionally kill the master dummy processes once every 30 seconds to simulate a failure event. Everything works fine with light configuration such as: 27 nodes + 1db + 729 partitions. However, when the configuration is heavy, such as 81 nodes + 10 databases + 81 partitions for each db, the controller latency increases significantly after several failure events: Control Latency (ms) First event : 182 Second event: 188 Third event: 200 Fourth Event: 193 Fifth event: 200 Sixth event: 185 Seventh event: 189 Eight event: 213 Ninth Event: 1082209 And then after this extremely long failure, the helix controller stop working. The controller log is as attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)