hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rakesh Radhakrishnan <rake...@apache.org>
Subject Re: About Archival Storage
Date Wed, 20 Jul 2016 06:29:38 GMT
Based on storage policy the data from hot storage will be moved to cold
storage. The storage policy defines the number of replicas to be located on
each storage type. It is possible to change the storage policy on a
directory(for example: HOT to COLD) and then invoke 'Mover tool' on that
directory to make the policy effective. One can set/change the storage
policy via HDFSCommand, "hdfs storagepolicies -setStoragePolicy -path
<path> -policy <policy>". After setting the new policy, you need to run the
tool, then it identifies the replicas to be moved based on the storage
policy information, and schedules the movement between source and
destination data nodes to satisfy the policy. Internally, the tool is
comparing the 'storage type' of a block in order to fulfill the 'storage
policy' requirement.

Probably you can refer
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
to know more about storage types, storage policies and hdfs commands. Hope
this helps.

Rakesh

On Wed, Jul 20, 2016 at 10:30 AM, kevin <kiss.kevin119@gmail.com> wrote:

> Thanks again. "automatically" what I mean is the hdfs mover knows the hot
> data have come to cold , I don't need to tell it what exactly files/dirs
> need to be move now ?
> Of course I should tell it what files/dirs need to monitoring.
>
> 2016-07-20 12:35 GMT+08:00 Rakesh Radhakrishnan <rakeshr@apache.org>:
>
>> >>>I have another question is , hdfs mover (A New Data Migration Tool )
>> know when to move data from hot to cold  automatically ?
>> While running the tool, it reads the argument and get the separated list
>> of hdfs files/dirs to migrate. Then it periodically scans these files in
>> HDFS to check if the block placement satisfies the storage policy, if not
>> satisfied it moves the replicas to a different storage type in order to
>> fulfill the storage policy requirement. This cycle continues until it hits
>> an error or no blocks to move etc. Could you please tell me, what do you
>> meant by "automatically" ? FYI, HDFS-10285 is proposing an idea to
>> introduce a daemon thread in Namenode to track the storage movements set by
>> APIs from clients. This Daemon thread named as StoragePolicySatisfier(SPS)
>> serves something similar to ReplicationMonitor. If interested you can read
>> the https://goo.gl/NA5EY0 proposal/idea and welcome feedback.
>>
>> Sleep time between each cycle is, ('dfs.heartbeat.interval' * 2000) +
>> ('dfs.namenode.replication.interval' * 1000) milliseconds;
>>
>> >>>It use algorithm like LRU、LFU ?
>> It will simply iterating over the lists in the order of files/dirs given
>> to this tool as an argument. afaik, its just maintains the order mentioned
>> by the user.
>>
>> Regards,
>> Rakesh
>>
>>
>> On Wed, Jul 20, 2016 at 7:05 AM, kevin <kiss.kevin119@gmail.com> wrote:
>>
>>> Thanks a lot Rakesh.
>>>
>>> I have another question is , hdfs mover (A New Data Migration Tool )
>>> know when to move data from hot to cold  automatically ? It
>>> use algorithm like LRU、LFU ?
>>>
>>> 2016-07-19 19:55 GMT+08:00 Rakesh Radhakrishnan <rakeshr@apache.org>:
>>>
>>>> >>>>Is that mean I should config dfs.replication with 1 ?  if
more
>>>> than one I should not use *Lazy_Persist*  policies ?
>>>>
>>>> The idea of Lazy_Persist policy is, while writing blocks, one replica
>>>> will be placed in memory first and then it is lazily persisted into DISK.
>>>> It doesn't means that, you are not allowed to configure dfs.replication >
>>>> 1. If 'dfs.replication' is configured > 1 then the first replica will
be
>>>> placed in RAM_DISK and all the other replicas (n-1) will be written to the
>>>> DISK. Here the (n-1) replicas will have the overhead of pipeline
>>>> replication over the network and the DISK write latency on the write hot
>>>> path. So you will not get better performance results.
>>>>
>>>> IIUC, for getting memory latency benefits, it is recommended to use
>>>> replication=1. In this way, applications should be able to perform single
>>>> replica writes to a local DN with low latency. HDFS will store block data
>>>> in memory and lazily save it to disk avoiding incurring disk write latency
>>>> on the hot path. By writing to local memory we can also avoid checksum
>>>> computation on the hot path.
>>>>
>>>> Regards,
>>>> Rakesh
>>>>
>>>> On Tue, Jul 19, 2016 at 3:25 PM, kevin <kiss.kevin119@gmail.com> wrote:
>>>>
>>>>> I don't quite understand :"Note that the Lazy_Persist policy is useful
>>>>> only for single replica blocks. For blocks with more than one replicas,
all
>>>>> the replicas will be written to DISK since writing only one of the replicas
>>>>> to RAM_DISK does not improve the overall performance."
>>>>>
>>>>> Is that mean I should config dfs.replication with 1 ?  if more than
>>>>> one I should not use *Lazy_Persist*  policies ?
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message