Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 701C2D287 for ; Tue, 16 Oct 2012 01:25:28 +0000 (UTC) Received: (qmail 16067 invoked by uid 500); 16 Oct 2012 01:25:23 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 15943 invoked by uid 500); 16 Oct 2012 01:25:23 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 15936 invoked by uid 99); 16 Oct 2012 01:25:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Oct 2012 01:25:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of todd@cloudera.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Oct 2012 01:25:16 +0000 Received: by mail-la0-f48.google.com with SMTP id u2so4249835lag.35 for ; Mon, 15 Oct 2012 18:24:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-gm-message-state; bh=VbcJ00ND4l8eqn1H7+XbFr/+LKIlmyo/Ozz5I+Jh0OI=; b=jaY+iU1tHYClnQFm2ErE/1j3BDzMdCSzbPedGN2vv7+G5EXTrSW6Or9IxtDnhuywA9 FhhlSttmwSRBBdNiz8vaLoiaQOmzxC4j4oYRkbclAUxTxdIzofR6iurSpG035491OyXJ 6LSTh8ZrWnkLmofvrhc6GixUy7nnyDnN753DitRvAPEGAlb0vxjiq54yXPR6GgvLQ+we 4aolxGKxHHzQQSR1b+Q2vrXzFM5s9fHttUMJ0w5/wDo9qr/OBYDORC7pb3x/OWBYO4pQ Xaeok3Vp58OHByDMyjBUZ7a1+Y6HKAG/ct5ZRAE2Xfkc0gbod98d+PKpwpums8SpvJ0J heiA== Received: by 10.112.104.2 with SMTP id ga2mr4929950lbb.48.1350350695968; Mon, 15 Oct 2012 18:24:55 -0700 (PDT) MIME-Version: 1.0 Received: by 10.114.21.231 with HTTP; Mon, 15 Oct 2012 18:24:35 -0700 (PDT) In-Reply-To: References: From: Todd Lipcon Date: Mon, 15 Oct 2012 18:24:35 -0700 Message-ID: Subject: Re: Question about namenode HA To: =?UTF-8?B?6LCi6Imv?= Cc: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=14dae9d682b48e2f2104cc230548 X-Gm-Message-State: ALoCoQnEaqpO41kVkE3KzAv3h7f6BYv0RYUooTjCJjYGz9uKuasWZyeM72poyRUZPwnBeEb/zE6/ X-Virus-Checked: Checked by ClamAV on apache.org --14dae9d682b48e2f2104cc230548 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Liang, Answers inline below. On Sun, Oct 14, 2012 at 8:01 PM, =E8=B0=A2=E8=89=AF w= rote: > Hi Todd and other HA experts, > > I've two question: > > 1) why the zkfc is a seperate process, i mean, what's the primary design > consideration that we didn't integrate zkfc features into namenode self ? > > There are a few reasons for this design choice: 1) Like Steve said, it's easier to monitor a process from another process, rather than self-monitor. Consider, for example, what happens if the NN somehow gets into a deadlock. The process may still be alive, and a ZooKeeper thread would keep running, even though it is not successfully handling any operations. The ZKFC running in a separate process periodically pings the local NN via RPC to ensure that the RPC server is still working properly, not deadlocked, etc. 2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC will notice it quite quickly, and delete its own zookeeper node. If the NN were holding its own ZK session, you would have to wait for the full ZK session timeout to expire. So the external ZKFC results in a faster failover time for certain classes of failure. 2) If i deploy CDH4.1(included QJM feature), since QJM can do fencing > writer, so can i just config like this safely ? > dfs.ha.fencing.methods > shell(/bin/true) > > Yes, this is safe. The design of the QuorumJournalManager ensures that multiple conflicting writers cannot corrupt your namespace in any way. You might still consider using sshfence ahead of that, with a short configured timeout -- this provides "read fencing". Otherwise the old NN could theoretically serve stale reads for a few seconds before it noticed that it lost its ZK lease. But it's definitely not critical -- the old NN will eventually do some kind of write and abort itself. So, I'd recommend /bin/true as the last configured method in your fencing list with QJM. Thanks -Todd -- Todd Lipcon Software Engineer, Cloudera --14dae9d682b48e2f2104cc230548 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Liang,

Answers inline below.

On Sun, Oct 14, 2012 at 8:01 PM, =E8=B0=A2=E8=89=AF <xi= eliang@xiaomi.com> wrote:
Hi Todd and other HA experts,

I've two question:

1) why the zkfc is a seperate process, i mean, what's the primary desig= n consideration that we didn't integrate zkfc features into namenode se= lf ?


There are a few reasons for this desig= n choice:

1) =C2=A0Like Steve said, it's easie= r to monitor a process from another process, rather than self-monitor. Cons= ider, for example, what happens if the NN somehow gets into a deadlock. The= process may still be alive, and a ZooKeeper thread would keep running, eve= n though it is not successfully handling any operations. The ZKFC running i= n a separate process periodically pings the local NN via RPC to ensure that= the RPC server is still working properly, not deadlocked, etc.

2) If the NN process itself crashes (eg segfault due to= bad RAM), the ZKFC will notice it quite quickly, and delete its own zookee= per node. If the NN were holding its own ZK session, you would have to wait= for the full ZK session timeout to expire. So the external ZKFC results in= a faster failover time for certain classes of failure.


2) If i deploy CDH4.1(included QJM feature), =C2=A0since QJM can do fencing= writer, =C2=A0so can i just config like this safely ?
=C2=A0 =C2=A0 =C2=A0 =C2=A0 <name>dfs.ha.fencing.methods</name>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 <value>shell(/bin/true)</value>


Yes, this is safe. The design of the Q= uorumJournalManager ensures that multiple conflicting writers cannot corrup= t your namespace in any way. You might still consider using sshfence ahead = of that, with a short configured timeout -- this provides "read fencin= g". Otherwise the old NN could theoretically serve stale reads for a f= ew seconds before it noticed that it lost its ZK lease. But it's defini= tely not critical -- the old NN will eventually do some kind of write and a= bort itself. So, I'd recommend /bin/true as the last configured method = in your fencing list with QJM.

Thanks
-Todd

--
Todd Lipcon
Software Engineer, Cloudera
--14dae9d682b48e2f2104cc230548--