hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Identifying new files on HDFS
Date Wed, 25 Mar 2015 21:24:25 GMT
Look at timestamps of the file? HDFS maintains both mtimes and atimes
(latter's not exposed in -ls though).

In ETL context, a simple workflow system also resolves this. You have an
incoming directory, a done directory, and a destination directory, etc. and
you can move around files pre/post processing for every job, to manage new
content/avoid repeated processing (as one simple example).

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mich@peridale.co.uk>

> Hi,
> Have you considered taking snapshot of files at close of business and
> compare it with the new snapshot and process only new ones? Just a simple
> shell script will do.
> Let your email find you with BlackBerry from Vodafone
> ------------------------------
> *From: * Vijaya Narayana Reddy Bhoomi Reddy <
> vijaya.bhoomireddy@whishworks.com>
> *Date: *Wed, 25 Mar 2015 09:55:57 +0000
> *To: *<user@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Identifying new files on HDFS
> Hi,
> We have a requirement to process only new files in HDFS on a daily basis.
> I am sure this is a general requirement in many ETL kind of processing
> scenarios. Just wondering if there is a way to identify new files that
> are added to a path in HDFS? For example, assume already some files were
> present for sometime. Now I have added new files today. So wanted to
> process only those new files. What is the best way to achieve this.
> Thanks & Regards
> Vijay
> *Vijay Bhoomireddy*, Big Data Architect
> 1000 Great West Road, Brentford, London, TW8 9DW
> *T:  +44 20 3475 7980 <%2B44%2020%203475%207980>*
> *M: **+44 7481 298 360 <%2B44%207481%20298%20360>*
> *W: *ww <http://www.whishworks.com/>w.whishworks.com
> <http://www.whishworks.com/>
> <https://www.linkedin.com/company/whishworks>
> <http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
> <https://www.facebook.com/whishworksit>
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.

Harsh J

View raw message