Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B3A1217A05 for ; Wed, 25 Mar 2015 21:56:33 +0000 (UTC) Received: (qmail 35774 invoked by uid 500); 25 Mar 2015 21:56:28 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 35669 invoked by uid 500); 25 Mar 2015 21:56:28 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 35659 invoked by uid 99); 25 Mar 2015 21:56:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Mar 2015 21:56:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=AC_DIV_BONANZA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [83.138.144.103] (HELO sulu.netzoomi.net) (83.138.144.103) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Mar 2015 21:56:23 +0000 Received: from vulcan.netzoomi.net (unknown [212.100.249.54]) by sulu.netzoomi.net (Postfix) with ESMTP id AE1376A4913 for ; Wed, 25 Mar 2015 21:54:55 +0000 (GMT) X-Envelope-From: Received: from vista (cpc7-seve18-2-0-cust228.13-3.cable.virginm.net [86.19.240.229]) by vulcan.netzoomi.net (Postfix) with ESMTPA id 375AB1248718 for ; Wed, 25 Mar 2015 21:54:55 +0000 (GMT) From: "Mich Talebzadeh" To: References: <570943545-1427305319-cardhu_decombobulator_blackberry.rim.net-1806674601-@b14.c3.bise7.blackberry> In-Reply-To: Subject: RE: Identifying new files on HDFS Date: Wed, 25 Mar 2015 21:54:42 -0000 Message-ID: <03f101d06746$4cf63580$e6e2a080$@co.uk> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_03F2_01D06746.4CF63580" X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AdBnQoFjO+rvfosMQd69N+bru6AfpQAA55LA Content-Language: en-gb X-domainnameshop-MailScanner-Information: Please contact the ISP for more information X-domainnameshop-MailScanner-ID: AE1376A4913.A2ECE X-domainnameshop-MailScanner: Found to be clean X-domainnameshop-MailScanner-SpamCheck: not spam, SpamAssassin (not cached, score=1.286, required 5, autolearn=disabled, AC_DIV_BONANZA 0.00, HTML_MESSAGE 0.00, RDNS_NONE 1.27, T_REMOTE_IMAGE 0.01) X-domainnameshop-MailScanner-SpamScore: 1 X-domainnameshop-MailScanner-From: mich@peridale.co.uk X-domainnameshop-MailScanner-Watermark: 1427925297.4127@Xb4Zw+W+LwjV36ct9+p1nA X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No This is a multi-part message in MIME format. ------=_NextPart_000_03F2_01D06746.4CF63580 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Good points. I will have done, empty and failed directories. =20 HTH =20 Mich Talebzadeh =20 http://talebzadehmich.wordpress.com =20 Publications due shortly: Creating in-memory Data Grid for Trading Systems with Oracle TimesTen = and Coherence Cache =20 NOTE: The information in this email is proprietary and confidential. = This message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Ltd, its subsidiaries or their employees, unless expressly so stated. It = is the responsibility of the recipient to ensure that this email is = virus free, therefore neither Peridale Ltd, its subsidiaries nor their = employees accept any responsibility. =20 From: Harsh J [mailto:harsh@cloudera.com]=20 Sent: 25 March 2015 21:24 To: user@hadoop.apache.org; mich@peridale.co.uk Subject: Re: Identifying new files on HDFS =20 Look at timestamps of the file? HDFS maintains both mtimes and atimes = (latter's not exposed in -ls though). =20 In ETL context, a simple workflow system also resolves this. You have an = incoming directory, a done directory, and a destination directory, etc. = and you can move around files pre/post processing for every job, to = manage new content/avoid repeated processing (as one simple example). =20 On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh = wrote: Hi, Have you considered taking snapshot of files at close of business and = compare it with the new snapshot and process only new ones? Just a = simple shell script will do. HTH Let your email find you with BlackBerry from Vodafone _____ =20 From: Vijaya Narayana Reddy Bhoomi Reddy = =20 Date: Wed, 25 Mar 2015 09:55:57 +0000 To: ReplyTo: user@hadoop.apache.org=20 Subject: Identifying new files on HDFS =20 Hi, =20 We have a requirement to process only new files in HDFS on a daily = basis. I am sure this is a general requirement in many ETL kind of = processing scenarios. Just wondering if there is a way to identify new = files that are added to a path in HDFS? For example, assume already some = files were present for sometime. Now I have added new files today. So = wanted to process only those new files. What is the best way to achieve = this. =20 Thanks & Regards Vijay =20 Vijay Bhoomireddy, Big Data Architect 1000 Great West Road, Brentford, London, TW8 9DW T: +44 20 3475 7980 =20 M: +44 7481 298 360 =20 W: ww = w.whishworks.com = = =20 The contents of this e-mail are confidential and for the exclusive use = of the intended recipient. If you receive this e-mail in error please = delete it from your system immediately and notify us either by e-mail or = telephone. You should not copy, forward or otherwise disclose the = content of the e-mail. The views expressed in this communication may not = necessarily be the view held by WHISHWORKS.=20 =20 --=20 Harsh J ------=_NextPart_000_03F2_01D06746.4CF63580 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable

Good points. = I will have done, empty and failed = directories.

 

HTH

 

M= ich Talebzadeh

<= o:p> 

h= ttp://talebzadehmich.wordpress.com

 

Publications due = shortly:

C= reating in-memory Data Grid for Trading Systems with Oracle TimesTen and = Coherence Cache

<= o:p> 

NO= TE: The information in this email is proprietary and confidential. This = message is for the designated recipient only, if you are not the = intended recipient, you should destroy it immediately. Any information = in this message shall not be understood as given or endorsed by Peridale = Ltd, its subsidiaries or their employees, unless expressly so stated. It = is the responsibility of the recipient to ensure that this email is = virus free, therefore neither Peridale Ltd, its subsidiaries nor their = employees accept any responsibility.

 

From:= Harsh J = [mailto:harsh@cloudera.com]
Sent: 25 March 2015 = 21:24
To: user@hadoop.apache.org; = mich@peridale.co.uk
Subject: Re: Identifying new files on = HDFS

 

Look at timestamps of the file? HDFS maintains both = mtimes and atimes (latter's not exposed in -ls = though).

 

In ETL = context, a simple workflow system also resolves this. You have an = incoming directory, a done directory, and a destination directory, etc. = and you can move around files pre/post processing for every job, to = manage new content/avoid repeated processing (as one simple = example).

 

On Wed, = Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mich@peridale.co.uk> = wrote:

Hi,

Have you = considered taking snapshot of files at close of business and compare it = with the new snapshot and process only new ones? Just a simple shell = script will do.

HTH

Let = your email find you with BlackBerry from = Vodafone


From: Vijaya = Narayana Reddy Bhoomi Reddy <vijaya.bhoomireddy@whishworks.com> =

Date: Wed, 25 Mar = 2015 09:55:57 +0000

ReplyTo: user@hadoop.apache.org =

Subject: = Identifying new files on HDFS

 

Hi,

 

We have a requirement = to process only new files in HDFS on a daily basis. I am sure this is a = general requirement in many ETL kind of processing = scenarios. Just = wondering if there is a way to identify new files that are added to a = path in HDFS? For example, assume already some files were present for = sometime. Now I have added new files today. So wanted to process only = those new files. What is the best way to achieve this.

 

Thanks & Regards

Vijay

=

<= p = style=3D'margin:0cm;margin-bottom:.0001pt;background-image:initial;backgr= ound-repeat:initial'>Vijay Bhoomireddy,=  Big Data Architect=

1= 000 Great West Road, Brentford, London, TW8 9DW=

=    


The contents of = this e-mail are confidential and for the exclusive use of the intended = recipient. If you receive this e-mail in error please delete it from = your system immediately and notify us either by e-mail or telephone. You = should not copy, forward or otherwise disclose the content of the = e-mail. The views expressed in this communication may not necessarily be = the view held by WHISHWORKS. =



 

-- =

Harsh = J

------=_NextPart_000_03F2_01D06746.4CF63580--