Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5371210869 for ; Wed, 20 Nov 2013 17:56:21 +0000 (UTC) Received: (qmail 88878 invoked by uid 500); 20 Nov 2013 17:56:20 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 88813 invoked by uid 500); 20 Nov 2013 17:56:19 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 88205 invoked by uid 99); 20 Nov 2013 17:56:16 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Nov 2013 17:56:16 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 89AEC1D3937; Wed, 20 Nov 2013 17:56:12 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============6646906862138138530==" MIME-Version: 1.0 Subject: Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test. From: keith@deenlo.com To: "Alex Moundalexis" Cc: "accumulo" , "Sean Busbey" , keith@deenlo.com Date: Wed, 20 Nov 2013 17:56:12 -0000 Message-ID: <20131120175612.6018.53707@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: noreply@reviews.apache.org X-ReviewGroup: accumulo X-ReviewRequest-URL: https://reviews.apache.org/r/15650/ X-Sender: noreply@reviews.apache.org References: <20131120161643.6017.90631@reviews.apache.org> In-Reply-To: <20131120161643.6017.90631@reviews.apache.org> Reply-To: keith@deenlo.com X-ReviewRequest-Repository: accumulo --===============6646906862138138530== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit > On Nov. 20, 2013, 4:16 p.m., kturner wrote: > > test/system/continuous/hdfs-agitator.pl, line 104 > > > > > > What are the pros and cons of using this haadmin command vs killing namenode processes? > > Sean Busbey wrote: > Pro haadmin: > > * The underlying HDFS instance may not be configured for automatic failover. > * The haadmin command doesn't require knowing where the NameNode processes are running within the cluster. > * The haadmin tool is a publicly exposed way of saying "do a failover", whereas finding the NameNode to kill will be a heuristic. > > Pro killing namenode: > > * If you specifically need to test what happens when it's the automatic failover process kicking in > > Note that I don't think the pro-killing pro is that strong of a pro. The haadmin command still needs to transition the active to standby and then the standby to active, so systems above HDFS are going to already encounter e.g. gaps in there being an active namenode. > > kturner wrote: > I made the following comment on the dev list earlier because review board was not working. I suspect killing the processes would yield slightly more realistic test results, but it certainly makes our scripts more unwieldy. Maybe a better way to do this it to work towards moving hdfs agitation into hdfs itself. > > Taking things a bit further, killing processes is not as effective in test as really killing machines (because of it does not expose issues like unflushed data in OS caches). > > On to another issue. Does the script ever kill all ha namnodes? Is this possible w/ haadmin? > > Sean Busbey wrote: > The kind of testing you're talking about generally happens in BigTop, rather than in individual components (e.g. HBase). > > Haadmin doesn't have a command to take NameNodes offline, just to mark them as standby rather than active. I believe you could use haadmin to force all namenodes to standby mode, but I'm would suspect in a set up with automatic failover that the failover controllers would cause one to become active again. I'll check to confirm this. > > Actually getting to the point of killing machines requires something external, e.g. the ability to talk to power managers or VMs. If we're looking for that level of fault testing, then I think we're better off deferring to BigTop and trying to improve both Accumulo's presence there and the use of e.g. Chaos Monkey or Gremlins. I agree killing machines is out of scope. I brought it up to argue against killing processes, but did not complete the thought. Regardless of how its done, it would be nice to test Accumulo when there are temporarily no datanodes and/or no namenode. That may be a separate issue if its more features than the scripts currently have. - kturner ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/15650/#review29167 ----------------------------------------------------------- On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/15650/ > ----------------------------------------------------------- > > (Updated Nov. 18, 2013, 5:13 p.m.) > > > Review request for accumulo and Alex Moundalexis. > > > Bugs: ACCUMULO-1794 > https://issues.apache.org/jira/browse/ACCUMULO-1794 > > > Repository: accumulo > > > Description > ------- > > ACCUMULO-1794 adds hdfs failover to continuous integration test. > > > Diffs > ----- > > test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 > test/system/continuous/hdfs-agitator.pl PRE-CREATION > test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 > test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f > > Diff: https://reviews.apache.org/r/15650/diff/ > > > Testing > ------- > > Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went. > > > Thanks, > > Sean Busbey > > --===============6646906862138138530==--