Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 2123 invoked from network); 10 Dec 2007 21:17:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Dec 2007 21:17:54 -0000 Received: (qmail 24517 invoked by uid 500); 10 Dec 2007 21:17:41 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 24431 invoked by uid 500); 10 Dec 2007 21:17:40 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 24411 invoked by uid 99); 10 Dec 2007 21:17:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Dec 2007 13:17:40 -0800 X-ASF-Spam-Status: No, hits=4.0 required=10.0 tests=FS_REPLICA,RCVD_IN_DNSWL_LOW,RCVD_NUMERIC_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [69.50.2.13] (HELO ex9.myhostedexchange.com) (69.50.2.13) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Dec 2007 21:17:42 +0000 Received: from 206.169.1.36 ([206.169.1.36]) by ex9.hostedexchange.local ([69.50.2.13]) with Microsoft Exchange Server HTTP-DAV ; Mon, 10 Dec 2007 21:17:18 +0000 User-Agent: Microsoft-Entourage/11.3.3.061214 Date: Mon, 10 Dec 2007 13:17:17 -0800 Subject: Re: HDFS tool and replication questions... From: Ted Dunning To: Message-ID: Thread-Topic: HDFS tool and replication questions... Thread-Index: Acg7Z0Tf/PQokYP0RsG4xg77LdhTSAACThLgAABjW8g= In-Reply-To: <84E52AD05F6F884AAFF3344FE4C95991918F66@SNV-EXVS08.ds.corp.yahoo.com> Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org More to the specific point, yes, all 100 nodes will wind up storing data for large files because blocks should be assigned pretty much at random. The exception is files that originate on a datanode. There, the local node gets one copy of each block. Replica blocks follow the random rule, however, so that you wind up in the same place in the end. On 12/10/07 1:10 PM, "dhruba Borthakur" wrote: > The replication factor should be such that it can provide some level of > availability and performance. HDFS attempts to distribute replicas of a > block so that they reside across multiple racks. HDFS block replication > is *purely* block-based and file-agnostic; i.e. blocks belonging to the > same file are handled precisely the same way as blocks belonging to > different files. > > Hope this helps, > dhruba > > Also, are there any metrics or best practices around what the > replication factor should be based on the number of nodes in the grid? > Does HDFS attempt to involve all nodes in the grid in replication? In > other words, if I have 100 nodes in my grid, and a replication factor of > 6, will all 100 nodes wind up storing data for a given file assuming the > file large enough? > > Thanks, > C G > > > --------------------------------- > Looking for last minute shopping deals? Find them fast with Yahoo! > Search.