hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Smith <csmithx+had...@gmail.com>
Subject RE: Question about disk space allocation in hadoop
Date Wed, 30 Jun 2010 14:28:55 GMT
Some thoughts on how to restrict the temporary data, but I have only
tried (a) in anger:

a)    Partition your disks into HDFS and intermediate temp partitions
of the relevant size.  This gives a fixed separation but is
difficult/impossible to modify on a busy cluster especially as there
may be no way of unloading/recovering the data stored in HDFS if you
make a mistake resizing partitions;

b)      Implement disk quotas and set relevant hard and soft limits on
the relevant root directories for intermediate space. This gives you
the flexibility to change the limits when required but as the limits
are per user/group some thought may be required as to which user/group
the limits apply to. There may also be a performance impact?

You could combine this with setting “dfs.datanode.du.reserved” value
in $HADOOP_HOME/conf/hdfs-site.xml for limiting HDFS disk usage.

c)      Implement intermediate data space as a loopback file, see:
http://wiki.cita.utoronto.ca/mediawiki/index.php/Fake_Fast_Local_Disk
This example implements a temporary loopback filesystem on a iSCSI
mounted Lustre filesystem but the principles are the same. There are
some performance benchmarks linked to in section 3. The intermediate
temp data space is limited by the size of the loopback file created.

Chris

-----Original Message-----
From: Yu Li [mailto:carp84@gmail.com]
Sent: 30 June 2010 04:11
To: common-user@hadoop.apache.org
Subject: Re: Question about disk space allocation in hadoop

Hi all,

Anybody has experience on this? Any Comments/Suggestions would be
highly appreciated, Thanks.

Best Regards,
Carp

2010/6/29 Yu Li <carp84@gmail.com>:
> Hi all,
>
> As we all know, machines in hadoop cluster may be both datanode and
> tasktracker, so one machine may store both MR job intermediate data
> and HDFS data. My question is: if we have more than one disk per node,
> say 4 disks, and would like both job intermediate data and HDFS data
> store into all disks to reduce IO times of each single disk, can we
> draw a line between space of local FS and HDFS? For example, restrict
> the intermediate temp data occupy no more than 25% space on each disk?
> Thanks in advance.
>
> Best Regards,
> Carp
>

Mime
View raw message