From hadoop-dev-return-2448-apmail-lucene-hadoop-dev-archive=lucene.apache.org@lucene.apache.org Sun Jul 09 06:34:36 2006 Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 39080 invoked from network); 9 Jul 2006 06:34:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 9 Jul 2006 06:34:36 -0000 Received: (qmail 47284 invoked by uid 500); 9 Jul 2006 06:34:35 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 47250 invoked by uid 500); 9 Jul 2006 06:34:35 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 47240 invoked by uid 99); 9 Jul 2006 06:34:35 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Jul 2006 23:34:35 -0700 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_WHOIS X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [207.126.228.149] (HELO rsmtp1.corp.yahoo.com) (207.126.228.149) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 08 Jul 2006 23:34:34 -0700 Received: from [10.0.1.2] ([172.21.179.131]) (authenticated bits=0) by rsmtp1.corp.yahoo.com (8.13.6/8.13.6/y.rout) with ESMTP id k696Y9aK043554 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=NO) for ; Sat, 8 Jul 2006 23:34:09 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=mime-version:in-reply-to:references:content-type:message-id: content-transfer-encoding:from:subject:date:to:x-mailer; b=b6tGT5TrC89Ft/z3BsvlRq1h/H2nhmKFlbRyEVSzXo/p7RonV9YEk0JTOS58j0Nd Mime-Version: 1.0 (Apple Message framework v752.2) In-Reply-To: <44AEB6B8.9030000@yahoo-inc.com> References: <44A5C920.5050808@yahoo-inc.com> <9DC52EE0-B405-4B34-A420-7A7BBF9EF548@yahoo-inc.com> <44AE0D76.1020905@yahoo-inc.com> <44AEB6B8.9030000@yahoo-inc.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <407BF40F-11EC-4DCC-AA72-462BFD78B628@yahoo-inc.com> Content-Transfer-Encoding: 7bit From: Eric Baldeschwieler Subject: Re: Hadoop Distributed File System requirements on Wiki Date: Sat, 8 Jul 2006 23:34:03 -0700 To: hadoop-dev@lucene.apache.org X-Mailer: Apple Mail (2.752.2) X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I think we need to have task(s) on the list detailing upgrades. Also process around releasing filesystem changes that change durable data. And testing... On Jul 7, 2006, at 12:32 PM, Konstantin Shvachko wrote: > Paul Sutter wrote: > >> On 7/7/06, Konstantin Shvachko wrote: >> >>> > *Recoverability and Availability Goals* >>> > >>> > You might want to consider adding recoverability and >>> availability goals. >>> >>> This is an interesting observation. Ideally, we would like to >>> save and >>> replicate >>> fs image file as soon as the edits file reaches a specific size, >>> and we >>> would like >>> to make edits file updates transactional, with the file system >>> locked >>> for updates >>> during the transaction. This would be the zero recoverability >>> goal in >>> your terms. >>> Are we willing to weaken this requirement in favor of the >>> performance? >> >> Actually its OK for me if we lose even an hour of data on a namenode >> crash, since I can just resubmit the recent jobs. Less loss is >> better, >> but my suggestion would be to favor simplicity over absolute recovery >> if thats a tradeoff. Others might feel differently about acceptable >> levels of data loss. > > I agree, simplicity is also very important. > >>> > Availability goals are probably less stringent than for most >>> storage >>> > systems >>> > (dare I say that a few hours downtime is probably OK) Adding these >>> > goals to >>> > the document could be valuable for consensus and prioritization. >>> >>> If I understood you correctly, this goal is more related to a >>> specific >>> installation of >>> the system rather than to the system itself as a software product. >>> Or do you mean that the total time spent by the system on self- >>> maintenance >>> procedures like backups and checkpointing should not exceed 2 >>> hours a day? >>> In any case, I agree, high availability should be mentioned, >>> probably in the >>> "Feature requirements" section. >> >> Its about features. Is namenode failover automatic or manual? If its >> manual, it takes time. And it should definitely be manual for now. >> Seamless namenode failover done right is a lot of work, and >> unnecessary. >> >> With manual failover, what is the downtime when a namenode fails? >> Well, I imagine that you'd want to take everything down, bring the >> filesystem up in safe mode (nice feature!) on the new namenode, >> and do >> some kind of fscheck. And then, when you're comfortable that >> everything is copacetic, all your files are present, and that the >> filesystem wont do a radical dereplication of every block when you >> make it writable, you make it writable. (In fact, the secondary >> namenode might always come up in safe mode until manually changed). >> >> How long does this take? Well, during this time the system is >> unavailable. And if it fails at 2AM, you're probably not back up >> before 10AM. >> >> But thats OK. Better to be down for a few hours (manual failover) >> than >> to have a complex system likely to break (seamless automatic >> failover). > > That's a good point. We should probably add a task to define/describe > manual failover procedures and to evaluate the availability goal > that we > can reasonably guarantee. > >>> >> > *Backup Scheme* >>> >> > ** >>> >> > We might want to start discussion of a backup scheme for HDFS, >>> >> > especially >>> >> > given all the courageous rewriting and feature-addition >>> likely to >>> >> > occur. >>> >>... >>> > >>> > But as for covering my fears, I'll feel safer with key data >>> backed up >>> > in a filesystem that is not DFS, as pedestrian as that sounds. :) >>> >>> Frankly speaking I've never thought about a backup of a 10 PB >>> storage >>> system. How much space will that require? Isn't it easier just to >>> increase >>> the replication factor? Just a thought... >> >> >> > ** >> >> Increasing replication doesnt protect me against a filesystem bug. >> >> I'm a nervous nelly on this one: file system revisions do scare me, >> and I dont have a 10PB system. Lets say I have a 100TB system, and >> that to get back into production I need only restore 5TB worth of >> critical files. Then once I'm back in production I can gradually >> restore the next 25TB and regenerate the rest. >> >> Its feasible and probably prudent. Its not that Im expecting data >> loss >> bugs in new code. My concern is less about the likelihood of the >> problem, and more about the severity of the problem. >> >> To back up a 10PB system, you would want to back it up to a second >> 10PB system located on an opposite coast. In fact if this system is >> important to your business, you must do this. And then there is the >> question, do you stagger software updates on these two systems? >> Probably. >> >> You might want to find someone from EMC or Netapp, and get their >> feedback on how software changes, QA, and beta testing is handled >> (including timelines). Storage systems are really a risky type of >> code >> to modify, for lots of reasons more apparent to the downstream >> consumers than to developers. :) > > I guess if we want to separate the backup from the original storage > on the hardware level we have two options > a) mirror data to another dfs cluster (earlier version, opposite cost) > b) copy critical data to a different (local) fs > If only 5% of the whole data set is critical you might want to go > with (b). > This can be a separate (dfs based) application or an extension to dfs. > If ~100% is critical then (a) is the only way. > On a related issue, do we want to add the upgrade procedures task > to the list? > > Thanks, > Konstantin