Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 26082 invoked from network); 6 Feb 2007 20:02:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Feb 2007 20:02:15 -0000 Received: (qmail 82918 invoked by uid 500); 6 Feb 2007 20:02:20 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 82896 invoked by uid 500); 6 Feb 2007 20:02:20 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 82887 invoked by uid 99); 6 Feb 2007 20:02:20 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Feb 2007 12:02:20 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [216.148.227.155] (HELO rwcrmhc15.comcast.net) (216.148.227.155) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Feb 2007 12:02:10 -0800 Received: from [192.168.168.15] (c-71-202-24-246.hsd1.ca.comcast.net[71.202.24.246]) by comcast.net (rwcrmhc15) with ESMTP id <20070206200150m1500r7mure>; Tue, 6 Feb 2007 20:01:50 +0000 Message-ID: <45C8DEA3.5080201@apache.org> Date: Tue, 06 Feb 2007 12:01:39 -0800 From: Doug Cutting User-Agent: Thunderbird 1.5.0.9 (X11/20070104) MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: Re: Large data sets References: <1bf79d3e0702021221j198e69bcl4f73e6ef723a742a@mail.gmail.com> <1bf79d3e0702051511ifae7564udb9f1bf0e95ff83e@mail.gmail.com> <45C8D4FB.60707@yahoo-inc.com> In-Reply-To: <45C8D4FB.60707@yahoo-inc.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Konstantin Shvachko wrote: > 200 bytes per file is theoretically correct, but rather optimistic :-( > From a real system memory utilization I can see that HDFS uses 1.5-2K > per file. > And since each real file is internally represented by two files (1 real > + 1 crc) the real > estimate per file should read 3-4K. But also note that there are plans to address these over the coming months. For a start: https://issues.apache.org/jira/browse/HADOOP-803 https://issues.apache.org/jira/browse/HADOOP-928 Once checksums are optional then we can replace their implementation in HDFS with something that does not consume namespace. Long term we hope to approach ~100 bytes per file. Doug