Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 20578 invoked from network); 14 Jul 2006 09:48:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 14 Jul 2006 09:48:21 -0000 Received: (qmail 93271 invoked by uid 500); 14 Jul 2006 09:48:20 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 93256 invoked by uid 500); 14 Jul 2006 09:48:20 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 93247 invoked by uid 99); 14 Jul 2006 09:48:20 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2006 02:48:20 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [209.10.110.95] (HELO londo.swishmail.com) (209.10.110.95) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2006 02:48:19 -0700 Received: (qmail 60975 invoked by uid 89); 14 Jul 2006 09:47:58 -0000 Received: from unknown (HELO ?172.17.73.107?) (128.214.173.238) by londo.swishmail.com with SMTP; 14 Jul 2006 09:47:58 -0000 Message-ID: <44B76855.7040406@apache.org> Date: Fri, 14 Jul 2006 12:48:05 +0300 From: Doug Cutting User-Agent: Thunderbird 1.5.0.4 (X11/20060615) MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Re: DFS question: does append-only means faster updates ? References: <20060713062302.84781.qmail@web34306.mail.mud.yahoo.com> In-Reply-To: <20060713062302.84781.qmail@web34306.mail.mud.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N drwho wrote: > I've always wondered if a lack of overwrite / random-write op means that updates are much faster than convention filesystems.. Not really, since DFS is implemented on top of the ordinary filesystem, it's never any faster at serial access. What it adds is scalability (petabytes in a single namespace) and reliability (continuous access to data through disk and host failures) and distributed performance (1000 hosts reading or writing in parallel to the same logical FS). > The fact that both (dfs, gfs) support delete op, does it mean that > fragmentation will still be a big problem ? Fragmentation should not be a problem, since files are chunked into 128MB blocks stored in local filesystems. > Also, would the lack of overwrite / random-write ops mean that the filesystem is less suitable for apps like online word-processor or even online spreadsheet / database ? Yes, such applications are probably not appropriate for direct implementation on top of DFS. It would work, but it would not be the best utilization of resources. Google uses BigTable, layered on top of GFS, to store small items that may be independently updated. Hadoop may someday incorporate something like BigTable. Mike Cafarella has discussed this a bit on the hadoop-dev list: http://www.mail-archive.com/hadoop-dev@lucene.apache.org/msg01415.html http://www.mail-archive.com/hadoop-dev@lucene.apache.org/msg01443.html Doug Doug