hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kevin <kiss.kevin...@gmail.com>
Subject Re: About Archival Storage
Date Wed, 20 Jul 2016 06:48:02 GMT
Ok,thanks.It means that I need to decide which data is hot and which time
it is cold, then change it storage policy and tell  'Mover tool'  to move

2016-07-20 14:29 GMT+08:00 Rakesh Radhakrishnan <rakeshr@apache.org>:

> Based on storage policy the data from hot storage will be moved to cold
> storage. The storage policy defines the number of replicas to be located on
> each storage type. It is possible to change the storage policy on a
> directory(for example: HOT to COLD) and then invoke 'Mover tool' on that
> directory to make the policy effective. One can set/change the storage
> policy via HDFSCommand, "hdfs storagepolicies -setStoragePolicy -path
> <path> -policy <policy>". After setting the new policy, you need to run the
> tool, then it identifies the replicas to be moved based on the storage
> policy information, and schedules the movement between source and
> destination data nodes to satisfy the policy. Internally, the tool is
> comparing the 'storage type' of a block in order to fulfill the 'storage
> policy' requirement.
> Probably you can refer
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html
> to know more about storage types, storage policies and hdfs commands. Hope
> this helps.
> Rakesh
> On Wed, Jul 20, 2016 at 10:30 AM, kevin <kiss.kevin119@gmail.com> wrote:
>> Thanks again. "automatically" what I mean is the hdfs mover knows the
>> hot data have come to cold , I don't need to tell it what exactly files/dirs
>> need to be move now ?
>> Of course I should tell it what files/dirs need to monitoring.
>> 2016-07-20 12:35 GMT+08:00 Rakesh Radhakrishnan <rakeshr@apache.org>:
>>> >>>I have another question is , hdfs mover (A New Data Migration Tool
>>> know when to move data from hot to cold  automatically ?
>>> While running the tool, it reads the argument and get the separated list
>>> of hdfs files/dirs to migrate. Then it periodically scans these files in
>>> HDFS to check if the block placement satisfies the storage policy, if not
>>> satisfied it moves the replicas to a different storage type in order to
>>> fulfill the storage policy requirement. This cycle continues until it hits
>>> an error or no blocks to move etc. Could you please tell me, what do you
>>> meant by "automatically" ? FYI, HDFS-10285 is proposing an idea to
>>> introduce a daemon thread in Namenode to track the storage movements set by
>>> APIs from clients. This Daemon thread named as StoragePolicySatisfier(SPS)
>>> serves something similar to ReplicationMonitor. If interested you can read
>>> the https://goo.gl/NA5EY0 proposal/idea and welcome feedback.
>>> Sleep time between each cycle is, ('dfs.heartbeat.interval' * 2000) +
>>> ('dfs.namenode.replication.interval' * 1000) milliseconds;
>>> >>>It use algorithm like LRU、LFU ?
>>> It will simply iterating over the lists in the order of files/dirs given
>>> to this tool as an argument. afaik, its just maintains the order mentioned
>>> by the user.
>>> Regards,
>>> Rakesh
>>> On Wed, Jul 20, 2016 at 7:05 AM, kevin <kiss.kevin119@gmail.com> wrote:
>>>> Thanks a lot Rakesh.
>>>> I have another question is , hdfs mover (A New Data Migration Tool )
>>>> know when to move data from hot to cold  automatically ? It
>>>> use algorithm like LRU、LFU ?
>>>> 2016-07-19 19:55 GMT+08:00 Rakesh Radhakrishnan <rakeshr@apache.org>:
>>>>> >>>>Is that mean I should config dfs.replication with 1 ?
 if more
>>>>> than one I should not use *Lazy_Persist*  policies ?
>>>>> The idea of Lazy_Persist policy is, while writing blocks, one replica
>>>>> will be placed in memory first and then it is lazily persisted into DISK.
>>>>> It doesn't means that, you are not allowed to configure dfs.replication
>>>>> 1. If 'dfs.replication' is configured > 1 then the first replica will
>>>>> placed in RAM_DISK and all the other replicas (n-1) will be written to
>>>>> DISK. Here the (n-1) replicas will have the overhead of pipeline
>>>>> replication over the network and the DISK write latency on the write
>>>>> path. So you will not get better performance results.
>>>>> IIUC, for getting memory latency benefits, it is recommended to use
>>>>> replication=1. In this way, applications should be able to perform single
>>>>> replica writes to a local DN with low latency. HDFS will store block
>>>>> in memory and lazily save it to disk avoiding incurring disk write latency
>>>>> on the hot path. By writing to local memory we can also avoid checksum
>>>>> computation on the hot path.
>>>>> Regards,
>>>>> Rakesh
>>>>> On Tue, Jul 19, 2016 at 3:25 PM, kevin <kiss.kevin119@gmail.com>
>>>>> wrote:
>>>>>> I don't quite understand :"Note that the Lazy_Persist policy is
>>>>>> useful only for single replica blocks. For blocks with more than
>>>>>> replicas, all the replicas will be written to DISK since writing
only one
>>>>>> of the replicas to RAM_DISK does not improve the overall performance."
>>>>>> Is that mean I should config dfs.replication with 1 ?  if more than
>>>>>> one I should not use *Lazy_Persist*  policies ?

View raw message