Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: neutral (herse.apache.org: local policy)
Message-ID: <45C8DEA3.5080201@apache.org>
Date: Tue, 06 Feb 2007 12:01:39 -0800
From: Doug Cutting <cutting@apache.org>
User-Agent: Thunderbird 1.5.0.9 (X11/20070104)
MIME-Version: 1.0
To: hadoop-user@lucene.apache.org
Subject: Re: Large data sets
References: <1bf79d3e0702021221j198e69bcl4f73e6ef723a742a@mail.gmail.com>
	 <C1ECF33E.3151%jim@powerset.com>
 <1bf79d3e0702051511ifae7564udb9f1bf0e95ff83e@mail.gmail.com>
 <45C8D4FB.60707@yahoo-inc.com>
In-Reply-To: <45C8D4FB.60707@yahoo-inc.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Konstantin Shvachko wrote:
> 200 bytes per file is theoretically correct, but rather optimistic :-(
>  From a real system memory utilization I can see that HDFS uses 1.5-2K 
> per file.
> And since each real file is internally represented by two files (1 real 
> + 1 crc) the real
> estimate per file should read 3-4K.

But also note that there are plans to address these over the coming 
months.  For a start:

https://issues.apache.org/jira/browse/HADOOP-803
https://issues.apache.org/jira/browse/HADOOP-928

Once checksums are optional then we can replace their implementation in 
HDFS with something that does not consume namespace.

Long term we hope to approach ~100 bytes per file.

Doug