hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Chandel <amitchan...@gmail.com>
Subject using HDFS for a distributed storage system
Date Mon, 09 Feb 2009 04:06:23 GMT
Hi Group,

I am planning to use HDFS as a reliable and distributed file system for
batch operations. No plans as of now to run any map reduce job on top of it,
but in future we will be having map reduce operations on files in HDFS.

The current (test) system has 3 machines:
NameNode: dual core CPU, 2GB RAM, 500GB HDD
2 DataNodes: Both of them with a dual core CPU, 2GB of RAM and 1.5TB of
space with ext3 filesystem.

I just need to put and retrieve files from this system. The files which I
will put in HDFS varies from a few Bytes to a around 100MB, with the average
file-size being 5MB. and the number of files would grow around 20-50
million. To avoid hitting limit of number of files under a directory, I
store each file at the path derived by the SHA1 hash of its content (which
is 20bytes long, and I create a 10 level deep path using 2bytes for each
level). When I started the cluster a month back, I had kept the default
block size to 1MB.

The hardware specs mentioned at
http://wiki.apache.org/hadoop/MachineScalingconsiders running map
reduce operations. So not sure if my setup is good
enough. I would like to get input on this setup.
The final cluster would have each datanode with 8GB RAM, a quad core CPU,
and 25 TB attached storage.

I played with this setup a little and then planned to increase the disk
space on both the DataNodes. I started by  increasing its disk capacity of
first dataNode to 15TB and changing the underlying filesystem to XFS (which
made it a clean datanode), and put it back in the system. Before performing
this operation, I had inserted around 70000 files in HDFS.
showd  *677323 files and directories, 332419 blocks = 1009742 total *. I
guess the way I create a 10 level deep path for the file results in ~10
times the number of actual files in HDFS. Please correct me if I am wrong. I
then ran the rebalancer on the cleaned up DataNode, which was too slow
(writing 2blocks per second i.e. 2MBps) to begin with and died after a few
hours saying too many open files. I checked all the machiens and all the
DataNode and NameNode processes were running fine on all the respective
machines, but the dfshealth.jsp showd both the datanodes to be dead.
Re-starting the cluster brought both of them up. I guess this has to do with
RAM requirements. My question is how to figure out the RAM requirements of
DataNode and NameNode in this situation. The documentation states that both
Datanode and NameNode stores the block index. Its not quite clear if all the
index is in memory. Once I have figured that out, how can I instruct the
hadoop to rebalance with high priority?

Another question is regarding the "Non DFS used:" statistics shown on the
dfshealth.jsp: Is it  the space used to store the files and directory
metadata information (apart from the actual file content blocks)? Right now
it is 1/4th of the total space used by HDFS.

Some points which I have thought of over the last month to improve this
model are:
1. I should keep very small files (lets say smaller than 1KB) out of HDFS.
2. Reduce the dir level of the file path created by SHA1 hash (instead of
10, I can keep 3).
3. I should increase the block size to reduce the number of blocks in HDFS (
4aa34eb70805180030u5de8efaam6f1e9a8832636d42@mail.gmail.com> says it won't
result in waste of disk space).

More improvement advices are appreciated.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message