hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Azuryy Yu <azury...@gmail.com>
Subject Re: Datanode disk considerations
Date Wed, 08 Oct 2014 03:26:22 GMT
I think Brian gave the answer.

On Tue, Oct 7, 2014 at 9:13 PM, Brian C. Huffman <
bhuffman@etinternational.com> wrote:

> What about setting the dfs.datanode.fsdataset.volume.choosing.policy to
> org.apache.hadoop.hdfs.server. datanode.fsdataset.
> AvailableSpaceVolumeChoosingPolicy?
>
> Would that help?
>
> Regards,
> Brian
>
>
> On 08/06/2014 05:23 PM, Adam Faris wrote:
>
>> Hadoop balancer doesn’t balance data on the local drives, it balances
>> data between datanodes on the grid, so running the balancer won’t balance
>> data on the local datanode.
>>
>> The datanode process round-robins between data directories on local disk,
>> so it’s not unexpected to see the smaller drive fill faster.  Typically
>> people run the same size drives within each compute node to prevent this
>> from happening.
>>
>> You could partition the 2TB drive into four 500GB partitions.  This isn’t
>> optimal as you’ll have 4 write threads pointing at a single disk but is
>> fairly simple to implement.  Otherwise you’ll want to physically rebuild
>> your 4 nodes so each node has equal amounts of storage.
>>
>> I’d also like to suggest while restructuring your local filesystem, that
>> the tasktracker/nodemanager be given it’s own partition for writes.  If
>> both the tasktracker/nodemanger plus datanode process share a partition,
>> when the mappers spill to disk it will cause the HDFS space to shrink and
>> grow as the datanode is reporting back how much free space it has for it’s
>> partitions.
>>
>> Good luck.
>>
>> On Aug 6, 2014, at 1:51 PM, Felix Chern <idryman@gmail.com> wrote:
>>
>>  Run the “hadoop balencer” command on the namenode. It’s is used for
>>> balancing skewed data.
>>> http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#balancer
>>>
>>>
>>> On Aug 6, 2014, at 1:45 PM, Brian C. Huffman <
>>> bhuffman@etinternational.com> wrote:
>>>
>>>  All,
>>>>
>>>> We currently a Hadoop 2.2.0 cluster with the following characteristics:
>>>> - 4 nodes
>>>> - Each node is a datanode
>>>> - Each node has 3 physical disks for data: 2 x 500GB and 1 x 2TB disk.
>>>> - HDFS replication factor of 3
>>>>
>>>> It appears that our 500GB disks are filling up first (the alternative
>>>> would be to put 4 times the number of blocks on the 2TB disks per node).
>>>> I'm concerned that once the 500GB disks fill, our performance will slow
>>>> down (less spindles being read / written at the same time per node).  Is
>>>> this correct?  Is there anything we can do to change this behavior?
>>>>
>>>> Thanks,
>>>> Brian
>>>>
>>>>
>>>>
>
>

Mime
View raw message