falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh Srinivas <sur...@hortonworks.com>
Subject Re: lifecycle - retention
Date Fri, 22 Jan 2016 21:42:50 GMT
Sowmya, awesome and detailed! Thank you and you should encourage others to
do this too.

On 1/22/16, 12:20 PM, "Sowmya Ramesh" <sramesh@hortonworks.com> wrote:

>Hi John,
>
>Retention policy determines how long the data will remain on the cluster.
>
>Falcon kicks off the retention policy on the basis of the time value you
>specify in the retention limit:
>
>* Less than 24 hours: Falcon kicks off the retention policy job every 6
>hours
>* More than 24 hours: Falcon kicks off the retention policy job every 24
>hours
>
>When a feed is scheduled Falcon kicks off the retention policy
>immediately. When job runs, it deletes everything thats eligible for
>eviction - eligibility criteria is the date pattern on the partition and
>NOT creation date. For e.g. if the retention limit is 90 days then
>retention job consistently deletes files older than 90 days.
>
>I don¹t understand what do you mean by records inside the file. I am
>assuming you mean files within a directory.
>
>For retention, Falcon expects data to be in dated partitions. I will try
>to explain the retention policy logic with an example.
>Lets say your feed location is defined as below:
>
><locations>
>        <location type=³data"
>path=³/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
>        <location type="stats" path="/none"/>
>        <location type="meta" path="/none"/>
></locations>
>
>When the retention job is kicked off, it finds all the files that needs to
>be evicted based on retention policy. For the feed example mentioned above
>* It gets the location from the feed which is
>"/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}²
>* Then it uses pattern matching to find the file pattern to get the list
>of files for the feed: "/falcon/demo/primary/clicks/*-*-*-*²
>* Calls FileSystem.globStatus with the file pattern
>"/falcon/demo/primary/clicks/*-*-*-*² to get list of files
>* Gets the date from the file path. For e.g. If the file path is
>/falcon/demo/primary/clicks/2016-01-11-02 mapped date is
>2016-01-11-02T00:00Z
>* If the file path date is beyond the retention limit it's deleted
>
>As this uses pattern matching it is not time consuming.
>You can set retention policies on a per-cluster basis and not per field
>basis.
>
>Hope this helps. Let us know if you have any further queries.
>
>Thanks!
>
>On 1/22/16, 9:55 AM, "John Smith" <lenovomi@gmail.com> wrote:
>
>>Hello,
>>
>>I found that Falcon supports retention policy as part of the Lifecycle. I
>>am wondering how is it working, because its not clear to me by reading
>>the
>>documentation.
>>
>>Assume I store one file  (with thousands/million of records) into HDFS
>>and
>>I set retention period for 1 year.
>>
>>How is that retention period enforced on the records inside the file?
>>Does
>>it mean that scheduler executes some "flow" that reads record by record
>>of
>>the stored file every day and check the current date agains retention
>>date?
>>In case the current date >= retention date the record is removed. Is it
>>cpu/time consuming? Each check requires the full file scan?
>>
>>What will happen in scenario when I define different retention dates per
>>field?
>>
>>
>>
>>Thank you!
>>
>>Best,
>>John
>
>

Mime
View raw message