Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 77615 invoked from network); 2 Oct 2009 10:42:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Oct 2009 10:42:31 -0000 Received: (qmail 23345 invoked by uid 500); 2 Oct 2009 10:42:29 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 23264 invoked by uid 500); 2 Oct 2009 10:42:29 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 23254 invoked by uid 99); 2 Oct 2009 10:42:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Oct 2009 10:42:29 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of stas.oskin@gmail.com designates 209.85.220.222 as permitted sender) Received: from [209.85.220.222] (HELO mail-fx0-f222.google.com) (209.85.220.222) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Oct 2009 10:42:18 +0000 Received: by fxm22 with SMTP id 22so1004191fxm.36 for ; Fri, 02 Oct 2009 03:41:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=NlIaLxmLC5L8KBwp9mUr1ZLq93wMnfX+MdqicmMdBLA=; b=VGI2LaaPlf0kNuagHplpW0Bz7ApKU48XtB6tZanObYSSAMpTUxiDftMqo98ZIP8AOS 03fdL0eMT8okXeCxoWf1Y+JUxeDOCiAZTGiC7nnKIBcCRY3Tnp0M51c1TcwoTWrwVVcy JxaPhKh/fn2oVvE8uDu97+nMzgEkyetMZFHcM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=owhz4J3hLgRAD2m2RtVVMwlDMobXA6t7YHCTPsCv01sxWkH1fRBahEdkIy5k+nUuqH 2YRJeC4zYFy93KFrqh3BOQd7QgJPvVbeTvX7w3a2lywMEKAPirqtwNAlK859RNSW2it3 l2U+iPgT+5WW5pmdqxlL23RR9YackORIpbhtQ= MIME-Version: 1.0 Received: by 10.223.2.79 with SMTP id 15mr764514fai.106.1254480117652; Fri, 02 Oct 2009 03:41:57 -0700 (PDT) In-Reply-To: <4AC5CCC0.2020000@apache.org> References: <77938bc20910011053k7d53a14vc7558098375f2df0@mail.gmail.com> <45f85f70910011202n5da02482ke55f829a9e8b166c@mail.gmail.com> <77938bc20910011509y1705cc56q19545289fea42f0@mail.gmail.com> <4AC5CCC0.2020000@apache.org> Date: Fri, 2 Oct 2009 12:41:57 +0200 Message-ID: <77938bc20910020341v6472233cp6280279af03ea2f8@mail.gmail.com> Subject: Re: NameNode high availability From: Stas Oskin To: common-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi. The HA service (heartbeat) is running on Dom0, and when the primary node is down, it basically just starts the VM on the other node. So there not supposed to be any time issues. Can you explain a bit more about your approach, how to automate it for example? Thanks. On 10/2/09, Steve Loughran wrote: > Stas Oskin wrote: >> Hi. >> >> Could you share the way in which it didn't quite work? Would be valuable >>> information for the community. >>> >> >> The idea is to have a Xen machine dedicated to NN, and maybe to SNN, which >> would be running over DRBD, as described here: >> http://www.drbd.org/users-guide/ch-xen.html >> >> The VM will be monitored by heart-beat, which would restart it on another >> node when it fails. >> >> I wanted to go that way as I thought it's perfect in case of small >> cluster, >> as then the node can be re-used for other tasks. >> Once the cluster grows reasonably, the VM could be migrated to dedicated >> machine in live fashion - with minimum downtime. >> >> Problem is, that it didn't work as expected. The Xen over DRBD is just not >> reliable, as described. The most basic operation of live domain migration >> works only in 50% of cases. Most often the domain migration leaves the >> DRBD >> in read-only status, meaning the domain can't be cleanly shut down - only >> killed. This often leads in turn to NN meta-data corruption. > > It's probably a quirk of virtualisation, all those clocks and things, > causes trouble for any HA protocol running round the cluster. I would > not blame Xen, as VMWare and virtualbox are also tricky. > > As you have a virtual infrastructure, why not have an image of the 1ary > NN, ready to bring up on demand when the NN goes down, pointed at a copy > of the NN datasets? > -- Sent from my mobile device