From hadoop-dev-return-2403-apmail-lucene-hadoop-dev-archive=lucene.apache.org@lucene.apache.org Thu Jul 06 20:56:24 2006 Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 43637 invoked from network); 6 Jul 2006 20:56:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 6 Jul 2006 20:56:23 -0000 Received: (qmail 59057 invoked by uid 500); 6 Jul 2006 20:56:23 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 59033 invoked by uid 500); 6 Jul 2006 20:56:23 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 59024 invoked by uid 99); 6 Jul 2006 20:56:22 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jul 2006 13:56:22 -0700 X-ASF-Spam-Status: No, hits=0.8 required=10.0 tests=DNS_FROM_RFC_ABUSE,MAILTO_TO_SPAM_ADDR,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of sutter@gmail.com designates 64.233.162.205 as permitted sender) Received: from [64.233.162.205] (HELO nz-out-0102.google.com) (64.233.162.205) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jul 2006 13:56:22 -0700 Received: by nz-out-0102.google.com with SMTP id k1so1071677nzf for ; Thu, 06 Jul 2006 13:56:01 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=s/XwuDAtuJWx/zvLQiY45VpNG3Ldin62478PT8Am/HAG1+yaJZ1oui+eb/7p/QsI2Vz7IxElsXwzRJhl7FiOlM4AnLHl5qQyvZUMRyTU5eqwrNd0K0jLnv0z6Jc2JOyC5h4xyO8S+t0fngvu0DpV8lOdkskn12Wu+LKn+q3VVp4= Received: by 10.36.19.4 with SMTP id 4mr1513162nzs; Thu, 06 Jul 2006 13:56:01 -0700 (PDT) Received: by 10.36.132.16 with HTTP; Thu, 6 Jul 2006 13:56:01 -0700 (PDT) Message-ID: Date: Thu, 6 Jul 2006 13:56:01 -0700 From: "Paul Sutter" To: hadoop-dev@lucene.apache.org Subject: Re: Hadoop Distributed File System requirements on Wiki In-Reply-To: <9DC52EE0-B405-4B34-A420-7A7BBF9EF548@yahoo-inc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <44A5C920.5050808@yahoo-inc.com> <9DC52EE0-B405-4B34-A420-7A7BBF9EF548@yahoo-inc.com> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Eric, Thanks - response embedded below. One more suggestion: store a copy of the per-block metadata on the datanode. it doesnt have to have an updated copy of the filename, just the "original file name" and block offset would be fine. since you're adding truncation features, you'd want some kind of truncation generation number too. this would make possible a distributed namenode recovery, which is belt-and-suspenders valuable even after adding checkpointing features to the namenode. storing this metadata is more important than writing the recovery program, since the recovery program could be written after the disaster that makes it necessary. (just a suggestion). On 7/6/06, Eric Baldeschwieler wrote: > > On Jul 6, 2006, at 12:02 PM, Paul Sutter wrote: > > ... > > *Constant size file blocks (#16), -1* > > > > I vote to keep variable size blocks, especially because you are adding > > atomic append capabilities (#25). Variable length blocks creates the > > possibility for blocks that contain only whole records. This: > > - improves recoverability for large important files with one or more > > irrevocably lost blocks, and > > - makes it very clean for mappers to process local data blocks > > ... I think we can achieve our goal without compromising yours. > Each block can be of any size up to the files fixed block size. The > system can be aware of that and provide an API to report gaps and/or > an API option to skip them or see them as NULLs. This reporting can > be done at the datanode level allowing us to remove all the size data > & logic at the namenode level. > > ** If you agree, why don't we just add the above annotation to > konstantine's doc? Wow! Good idea, and now I see why you wanted to make the change in the first place. I agree, please go ahead and add. Incidently, its probably fine if - the API just skipped the ghost bytes, - programs using such files should only ever seek to locations that had been returned by getPos(), and - getPos() should return the byte offset of the next block as soon as a ghost byte is reached. I think existing programs will work fine within these restrictions. The last one is intended for code like SequenceFile that checks current position against file length when reading data. (SequenceFiles' syncing code might have to get reconsidered, but would be easier since you'd just advance to the next block on a checksum failure). > > *Recoverability and Availability Goals* > ... > > ** > > *Backup Scheme* > > ** > > We might want to start discussion of a backup scheme for HDFS, > > especially > > given all the courageous rewriting and feature-addition likely to > > occur. > > ** I agree, this needs to be on the list. I'm imagining a command > that hardlinks every datanode's (and namenode's if needed) files into > a snapshot directory. And another command that moves all current > state into a snapshot directory and hardlinks a snapshot's state back > into the working directory. This would be very fast and not cost > much space in the short term. Thoughts? (yes, hardlinks are a pain > on the PC, we can discuss design later) This is a fantastic idea. But as for covering my fears, I'll feel safer with key data backed up in a filesystem that is not DFS, as pedestrian as that sounds. :) > > *Rebalancing (#22,#21)* > > > > I would suggest that keeping disk usage balanced is more than a > > performance > > feature, its important for the success of running jobs with large map > > outputs or large sorts. Our most common reducer failure is running > > out of > > disk space during sort, and this is caused by imbalanced block > > allocation. > > ** Good point. Any interest in helping us with this one? We'll take a look at it. > > > > On 6/30/06, Konstantin Shvachko wrote: > >> > >> I've created a Wiki page that summarizes DFS requirements and > >> proposed > >> changes. > >> This is a summary of discussions held in this mailing list and > >> additional internal discussions. > >> The page is here: > >> > >> http://wiki.apache.org/lucene-hadoop/DFS_requirements > >> > >> I see there is an ongoing related discussion in HADOOP-337. > >> We prioritized our goals as > >> (1) Reliability (which includes Recoverability and Availability) > >> (2) Scalability > >> (3) Functionality > >> (4) Performance > >> (5) other > >> But then gave higher priority to some features like the append > >> functionality. > >> > >> Happy holidays to everybody. > >> > >> --Konstantin Shvachko > >> > >