hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mich Talebzadeh" <m...@peridale.co.uk>
Subject RE: Identifying new files on HDFS
Date Wed, 25 Mar 2015 21:54:42 GMT
Good points. I will have done, empty and failed directories.




Mich Talebzadeh




Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache


NOTE: The information in this email is proprietary and confidential. This message is for the
designated recipient only, if you are not the intended recipient, you should destroy it immediately.
Any information in this message shall not be understood as given or endorsed by Peridale Ltd,
its subsidiaries or their employees, unless expressly so stated. It is the responsibility
of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd,
its subsidiaries nor their employees accept any responsibility.


From: Harsh J [mailto:harsh@cloudera.com] 
Sent: 25 March 2015 21:24
To: user@hadoop.apache.org; mich@peridale.co.uk
Subject: Re: Identifying new files on HDFS


Look at timestamps of the file? HDFS maintains both mtimes and atimes (latter's not exposed
in -ls though).


In ETL context, a simple workflow system also resolves this. You have an incoming directory,
a done directory, and a destination directory, etc. and you can move around files pre/post
processing for every job, to manage new content/avoid repeated processing (as one simple example).


On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mich@peridale.co.uk> wrote:


Have you considered taking snapshot of files at close of business and compare it with the
new snapshot and process only new ones? Just a simple shell script will do.


Let your email find you with BlackBerry from Vodafone


From: Vijaya Narayana Reddy Bhoomi Reddy <vijaya.bhoomireddy@whishworks.com> 

Date: Wed, 25 Mar 2015 09:55:57 +0000

To: <user@hadoop.apache.org>

ReplyTo: user@hadoop.apache.org 

Subject: Identifying new files on HDFS




We have a requirement to process only new files in HDFS on a daily basis. I am sure this is
a general requirement in many ETL kind of processing scenarios. Just wondering if there is
a way to identify new files that are added to a path in HDFS? For example, assume already
some files were present for sometime. Now I have added new files today. So wanted to process
only those new files. What is the best way to achieve this.


Thanks & Regards



Vijay Bhoomireddy, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW

T:  +44 20 3475 7980 <tel:%2B44%2020%203475%207980> 

M: +44 7481 298 360 <tel:%2B44%207481%20298%20360> 

W:  <http://www.whishworks.com/> ww <http://www.whishworks.com/> w.whishworks.com

 <https://www.linkedin.com/company/whishworks>   <http://www.whishworks.com/blog/>
  <https://twitter.com/WHISHWORKS>   <https://www.facebook.com/whishworksit> 

The contents of this e-mail are confidential and for the exclusive use of the intended recipient.
If you receive this e-mail in error please delete it from your system immediately and notify
us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content
of the e-mail. The views expressed in this communication may not necessarily be the view held



Harsh J

View raw message