falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sowmya Ramesh <sram...@hortonworks.com>
Subject Re: lifecycle - retention
Date Fri, 22 Jan 2016 20:20:02 GMT
Hi John,

Retention policy determines how long the data will remain on the cluster.

Falcon kicks off the retention policy on the basis of the time value you
specify in the retention limit:

* Less than 24 hours: Falcon kicks off the retention policy job every 6
hours
* More than 24 hours: Falcon kicks off the retention policy job every 24
hours

When a feed is scheduled Falcon kicks off the retention policy
immediately. When job runs, it deletes everything thats eligible for
eviction - eligibility criteria is the date pattern on the partition and
NOT creation date. For e.g. if the retention limit is 90 days then
retention job consistently deletes files older than 90 days.

I don¹t understand what do you mean by records inside the file. I am
assuming you mean files within a directory.

For retention, Falcon expects data to be in dated partitions. I will try
to explain the retention policy logic with an example.
Lets say your feed location is defined as below:

<locations>
        <location type=³data"
path=³/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/none"/>
        <location type="meta" path="/none"/>
</locations>

When the retention job is kicked off, it finds all the files that needs to
be evicted based on retention policy. For the feed example mentioned above
* It gets the location from the feed which is
"/falcon/demo/primary/clicks/${YEAR}-${MONTH}-${DAY}-${HOUR}²
* Then it uses pattern matching to find the file pattern to get the list
of files for the feed: "/falcon/demo/primary/clicks/*-*-*-*²
* Calls FileSystem.globStatus with the file pattern
"/falcon/demo/primary/clicks/*-*-*-*² to get list of files
* Gets the date from the file path. For e.g. If the file path is
/falcon/demo/primary/clicks/2016-01-11-02 mapped date is
2016-01-11-02T00:00Z
* If the file path date is beyond the retention limit it's deleted

As this uses pattern matching it is not time consuming.
You can set retention policies on a per-cluster basis and not per field
basis.

Hope this helps. Let us know if you have any further queries.

Thanks!

On 1/22/16, 9:55 AM, "John Smith" <lenovomi@gmail.com> wrote:

>Hello,
>
>I found that Falcon supports retention policy as part of the Lifecycle. I
>am wondering how is it working, because its not clear to me by reading the
>documentation.
>
>Assume I store one file  (with thousands/million of records) into HDFS and
>I set retention period for 1 year.
>
>How is that retention period enforced on the records inside the file? Does
>it mean that scheduler executes some "flow" that reads record by record of
>the stored file every day and check the current date agains retention
>date?
>In case the current date >= retention date the record is removed. Is it
>cpu/time consuming? Each check requires the full file scan?
>
>What will happen in scenario when I define different retention dates per
>field?
>
>
>
>Thank you!
>
>Best,
>John


Mime
View raw message