Return-Path: X-Original-To: apmail-ambari-dev-archive@www.apache.org Delivered-To: apmail-ambari-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CC33918361 for ; Wed, 12 Aug 2015 22:34:37 +0000 (UTC) Received: (qmail 22460 invoked by uid 500); 12 Aug 2015 22:34:37 -0000 Delivered-To: apmail-ambari-dev-archive@ambari.apache.org Received: (qmail 22420 invoked by uid 500); 12 Aug 2015 22:34:37 -0000 Mailing-List: contact dev-help@ambari.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ambari.apache.org Delivered-To: mailing list dev@ambari.apache.org Received: (qmail 22395 invoked by uid 99); 12 Aug 2015 22:34:37 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Aug 2015 22:34:37 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 03366DC582; Wed, 12 Aug 2015 22:34:35 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============3812314514285272520==" MIME-Version: 1.0 Subject: Re: Review Request 37356: Fix synchronization on TopologyManager's outstanding requests From: "John Speidel" To: "John Speidel" , "Mahadev Konar" , "Robert Levas" , "Sumit Mohanty" , "Jonathan Hurley" Cc: "Robert Nettleton" , "Ambari" Date: Wed, 12 Aug 2015 22:34:35 -0000 Message-ID: <20150812223435.1374.55110@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: "John Speidel" X-ReviewGroup: Ambari X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/37356/ X-Sender: "John Speidel" References: <20150812215455.1374.59933@reviews.apache.org> In-Reply-To: <20150812215455.1374.59933@reviews.apache.org> Reply-To: "John Speidel" X-ReviewRequest-Repository: ambari --===============3812314514285272520== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/37356/#review95190 ----------------------------------------------------------- Ship it! Ship It! - John Speidel On Aug. 12, 2015, 9:54 p.m., Robert Nettleton wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/37356/ > ----------------------------------------------------------- > > (Updated Aug. 12, 2015, 9:54 p.m.) > > > Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, Robert Levas, and Sumit Mohanty. > > > Bugs: AMBARI-12720 > https://issues.apache.org/jira/browse/AMBARI-12720 > > > Repository: ambari > > > Description > ------- > > This patch provides a fix for AMBARI-12720. > > Blueprint Cluster deployments are failing under load, with clusters of 50+ nodes. One or more hosts in the cluster fail to be registered, even though the ambari-agent registration occurs properly for these hosts. This problem seems to occur when certain hosts are registered at roughly the same time as the call to TopologyManager.processRequest(). > > Upon further analysis, it appears that the TopologyManager.outstandingRequests object is not fully sychronized. > > There are two paths involved in reading/writing to this data structure: > > 1. TopologyManager.processRequest() (occurs on the main request thread handling the cluster creation request). When all known hosts have been processed, but some requests are pending, the LogicalRequest instance is added to TopologyManager.outstandingRequests. Note: The call to add this request is not synchronized on the outstandingRequests object, but is only synchronized on TopologyManager.availableHosts. > 2. TopologyManager.onHostRegistered() (occurs on a separate thread that handles ambari-agent registrations for new hosts joining the cluster). This method does synchronize on TopologyManager.outstandingRequests, and attempts to iterate over these requests to determine if the newly registered host can satisfy the request. > > The incorrect synchronization on TopologyManager.outstandingRequests appears to be the root of the intermittent cluster deployment problems described in AMBARI-12720. Since this datastructure is read/updated by both threads, and since this object is not protected by the same locks, there appears to be a race condition in this code. > > To fix this problem, this patch implements the following: > > 1. Updates the TopologyManager.onHostRegistered() method to lock on > --> availableHosts > --> outstandingRequests > prior to reading TopologyManager.outstandingRequests > > 2. Updates the TopologyManager.processRequest() method to lock on > --> availableHosts > --> outstandingRequests > prior to adding a LogicalRequest instance to this collection > > > The locking order (availableHosts, outstandingRequests) must be kept in place in order to avoid any potential deadlocks from making a change like this. > > > Diffs > ----- > > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java 4f51d47 > > Diff: https://reviews.apache.org/r/37356/diff/ > > > Testing > ------- > > 1. Ran the ambari-server unit test suite ("mvn clean test"). All tests are passing. > 2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this change does not break existing deployments. > 3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make sure this change does not break deployments of larger clusters. > > > Thanks, > > Robert Nettleton > > --===============3812314514285272520==--