Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: "Mich Talebzadeh" <mich@peridale.co.uk>
To: <user@hadoop.apache.org>
References: 
 <CAAfTDLWOrrARmmVXqZop4YeM1ioL4Em-McW+ti=8syVbVrJp=g@mail.gmail.com>
 <570943545-1427305319-cardhu_decombobulator_blackberry.rim.net-1806674601-@b14.c3.bise7.blackberry>
 <CAOcnVr0HzS=dub6_ccfet1+ZS+GPnrENz1_n3wig=Ld=W9nL5A@mail.gmail.com>
In-Reply-To: 
 <CAOcnVr0HzS=dub6_ccfet1+ZS+GPnrENz1_n3wig=Ld=W9nL5A@mail.gmail.com>
Subject: RE: Identifying new files on HDFS
Date: Wed, 25 Mar 2015 21:54:42 -0000
Message-ID: <03f101d06746$4cf63580$e6e2a080$@co.uk>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_03F2_01D06746.4CF63580"
Thread-Index: AdBnQoFjO+rvfosMQd69N+bru6AfpQAA55LA
Content-Language: en-gb

This is a multi-part message in MIME format.

------=_NextPart_000_03F2_01D06746.4CF63580
Content-Type: text/plain;
	charset="utf-8"
Content-Transfer-Encoding: quoted-printable

Good points. I will have done, empty and failed directories.

=20

HTH

=20

Mich Talebzadeh

=20

http://talebzadehmich.wordpress.com

=20

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen =
and Coherence Cache

=20

NOTE: The information in this email is proprietary and confidential. =
This message is for the designated recipient only, if you are not the =
intended recipient, you should destroy it immediately. Any information =
in this message shall not be understood as given or endorsed by Peridale =
Ltd, its subsidiaries or their employees, unless expressly so stated. It =
is the responsibility of the recipient to ensure that this email is =
virus free, therefore neither Peridale Ltd, its subsidiaries nor their =
employees accept any responsibility.

=20

From: Harsh J [mailto:harsh@cloudera.com]=20
Sent: 25 March 2015 21:24
To: user@hadoop.apache.org; mich@peridale.co.uk
Subject: Re: Identifying new files on HDFS

=20

Look at timestamps of the file? HDFS maintains both mtimes and atimes =
(latter's not exposed in -ls though).

=20

In ETL context, a simple workflow system also resolves this. You have an =
incoming directory, a done directory, and a destination directory, etc. =
and you can move around files pre/post processing for every job, to =
manage new content/avoid repeated processing (as one simple example).

=20

On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <mich@peridale.co.uk> =
wrote:

Hi,

Have you considered taking snapshot of files at close of business and =
compare it with the new snapshot and process only new ones? Just a =
simple shell script will do.

HTH

Let your email find you with BlackBerry from Vodafone

  _____ =20

From: Vijaya Narayana Reddy Bhoomi Reddy =
<vijaya.bhoomireddy@whishworks.com>=20

Date: Wed, 25 Mar 2015 09:55:57 +0000

To: <user@hadoop.apache.org>

ReplyTo: user@hadoop.apache.org=20

Subject: Identifying new files on HDFS

=20

Hi,

=20

We have a requirement to process only new files in HDFS on a daily =
basis. I am sure this is a general requirement in many ETL kind of =
processing scenarios. Just wondering if there is a way to identify new =
files that are added to a path in HDFS? For example, assume already some =
files were present for sometime. Now I have added new files today. So =
wanted to process only those new files. What is the best way to achieve =
this.

=20

Thanks & Regards

Vijay


  <http://www.whishworks.com/images/whishworks/WWlogotm.png>=20

Vijay Bhoomireddy, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW

T:  +44 20 3475 7980 <tel:%2B44%2020%203475%207980>=20

M: +44 7481 298 360 <tel:%2B44%207481%20298%20360>=20

W:  <http://www.whishworks.com/> ww <http://www.whishworks.com/> =
w.whishworks.com

 <https://www.linkedin.com/company/whishworks>   =
<http://www.whishworks.com/blog/>   <https://twitter.com/WHISHWORKS>   =
<https://www.facebook.com/whishworksit>=20


The contents of this e-mail are confidential and for the exclusive use =
of the intended recipient. If you receive this e-mail in error please =
delete it from your system immediately and notify us either by e-mail or =
telephone. You should not copy, forward or otherwise disclose the =
content of the e-mail. The views expressed in this communication may not =
necessarily be the view held by WHISHWORKS.=20


=20

--=20

Harsh J


------=_NextPart_000_03F2_01D06746.4CF63580
Content-Type: text/html;
	charset="utf-8"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta =
http-equiv=3DContent-Type content=3D"text/html; charset=3Dutf-8"><meta =
name=3DGenerator content=3D"Microsoft Word 12 (filtered medium)"><!--[if =
!mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
	{font-family:Helvetica;
	panose-1:2 11 6 4 2 2 2 2 2 4;}
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:Tahoma;
	panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p
	{mso-style-priority:99;
	mso-margin-top-alt:auto;
	margin-right:0cm;
	mso-margin-bottom-alt:auto;
	margin-left:0cm;
	font-size:12.0pt;
	font-family:"Times New Roman","serif";}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
	{mso-style-priority:99;
	mso-style-link:"Balloon Text Char";
	margin:0cm;
	margin-bottom:.0001pt;
	font-size:8.0pt;
	font-family:"Tahoma","sans-serif";}
span.BalloonTextChar
	{mso-style-name:"Balloon Text Char";
	mso-style-priority:99;
	mso-style-link:"Balloon Text";
	font-family:"Tahoma","sans-serif";}
span.EmailStyle20
	{mso-style-type:personal-reply;
	font-family:"Arial","sans-serif";
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-GB link=3Dblue =
vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif"'>Good points. =
I will have <i>done, empty and failed</i> =
directories.<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif"'><o:p>&nbsp;</=
o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif"'>HTH<o:p></o:p=
></span></p><p class=3DMsoNormal><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif"'><o:p>&nbsp;</=
o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Arial","sans-serif";color:black'>M=
ich Talebzadeh<o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:11.0pt;font-family:"Arial","sans-serif";color:black'><=
o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal =
style=3D'text-align:justify'><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>h=
ttp://talebzadehmich.wordpress.com<o:p></o:p></span></p><p =
class=3DMsoNormal style=3D'text-align:justify'><u><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:black;=
letter-spacing:-.15pt'><o:p><span =
style=3D'text-decoration:none'>&nbsp;</span></o:p></span></u></p><p =
class=3DMsoNormal style=3D'text-align:justify'><u><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:black;=
letter-spacing:-.15pt'>Publications due =
shortly:<o:p></o:p></span></u></p><p class=3DMsoNormal =
style=3D'text-align:justify'><b><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>C=
reating in-memory Data Grid for Trading Systems with Oracle TimesTen and =
Coherence Cache<o:p></o:p></span></b></p><p class=3DMsoNormal =
style=3D'text-align:justify'><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:black'><=
o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:7.5pt;font-family:"Arial","sans-serif";color:black'>NO=
TE: The information in this email is proprietary and confidential. This =
message is for the designated recipient only, if you are not the =
intended recipient, you should destroy it immediately. Any information =
in this message shall not be understood as given or endorsed by Peridale =
Ltd, its subsidiaries or their employees, unless expressly so stated. It =
is the responsibility of the recipient to ensure that this email is =
virus free, therefore neither Peridale Ltd, its subsidiaries nor their =
employees accept any responsibility.</span><span =
style=3D'font-size:11.0pt;font-family:"Calibri","sans-serif";color:black'=
><o:p></o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif"'><o:p>&nbsp;</=
o:p></span></p><div style=3D'border:none;border-top:solid #B5C4DF =
1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=3DMsoNormal><b><span =
lang=3DEN-US =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span>=
</b><span lang=3DEN-US =
style=3D'font-size:10.0pt;font-family:"Tahoma","sans-serif"'> Harsh J =
[mailto:harsh@cloudera.com] <br><b>Sent:</b> 25 March 2015 =
21:24<br><b>To:</b> user@hadoop.apache.org; =
mich@peridale.co.uk<br><b>Subject:</b> Re: Identifying new files on =
HDFS<o:p></o:p></span></p></div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><div><div><p =
class=3DMsoNormal>Look at timestamps of the file? HDFS maintains both =
mtimes and atimes (latter's not exposed in -ls =
though).<o:p></o:p></p></div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><p class=3DMsoNormal>In ETL =
context, a simple workflow system also resolves this. You have an =
incoming directory, a done directory, and a destination directory, etc. =
and you can move around files pre/post processing for every job, to =
manage new content/avoid repeated processing (as one simple =
example).<o:p></o:p></p><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><div><p class=3DMsoNormal>On Wed, =
Mar 25, 2015 at 11:11 PM, Mich Talebzadeh &lt;<a =
href=3D"mailto:mich@peridale.co.uk" =
target=3D"_blank">mich@peridale.co.uk</a>&gt; =
wrote:<o:p></o:p></p><div><p class=3DMsoNormal>Hi,<br><br>Have you =
considered taking snapshot of files at close of business and compare it =
with the new snapshot and process only new ones? Just a simple shell =
script will do.<br><br>HTH<o:p></o:p></p><div><p class=3DMsoNormal>Let =
your email find you with BlackBerry from =
Vodafone<o:p></o:p></p></div><div class=3DMsoNormal align=3Dcenter =
style=3D'text-align:center'><hr size=3D2 width=3D"100%" =
align=3Dcenter></div><div><p class=3DMsoNormal><b>From: </b>Vijaya =
Narayana Reddy Bhoomi Reddy &lt;<a =
href=3D"mailto:vijaya.bhoomireddy@whishworks.com" =
target=3D"_blank">vijaya.bhoomireddy@whishworks.com</a>&gt; =
<o:p></o:p></p></div><div><p class=3DMsoNormal><b>Date: </b>Wed, 25 Mar =
2015 09:55:57 +0000<o:p></o:p></p></div><div><p class=3DMsoNormal><b>To: =
</b>&lt;<a href=3D"mailto:user@hadoop.apache.org" =
target=3D"_blank">user@hadoop.apache.org</a>&gt;<o:p></o:p></p></div><div=
><p class=3DMsoNormal><b>ReplyTo: </b><a =
href=3D"mailto:user@hadoop.apache.org" =
target=3D"_blank">user@hadoop.apache.org</a> =
<o:p></o:p></p></div><div><p class=3DMsoNormal><b>Subject: =
</b>Identifying new files on HDFS<o:p></o:p></p></div><div><div><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><div><p =
class=3DMsoNormal><span =
style=3D'font-size:9.5pt'>Hi,</span><o:p></o:p></p><div><p =
class=3DMsoNormal><span =
style=3D'font-size:9.5pt'><o:p>&nbsp;</o:p></span></p></div><div><p =
class=3DMsoNormal><span style=3D'font-size:9.5pt'>We have a requirement =
to process only new files in HDFS on a daily basis. I am sure this is a =
general requirement in many ETL kind of processing =
scenarios.&nbsp;</span><span =
style=3D'font-size:9.5pt;font-family:"Helvetica","sans-serif"'>Just =
wondering if there is a way to identify new files that are added to a =
path in HDFS? For example, assume already some files were present for =
sometime. Now I have added new files today. So wanted to process only =
those new files. What is the best way to achieve this.</span><span =
style=3D'font-size:9.5pt'><o:p></o:p></span></p></div><div><p =
class=3DMsoNormal><span =
style=3D'font-size:9.5pt'><o:p>&nbsp;</o:p></span></p></div><div><div><di=
v><div><div><div><div><div><div><div><div><div><div><p =
class=3DMsoNormal><span =
style=3D'font-size:9.0pt;font-family:"Helvetica","sans-serif";color:black=
'>Thanks &amp; Regards<o:p></o:p></span></p></div><div><p =
class=3DMsoNormal style=3D'margin-bottom:12.0pt'><span =
style=3D'font-size:9.0pt;font-family:"Helvetica","sans-serif";color:black=
'>Vijay<br><br><o:p></o:p></span></p><table class=3DMsoNormalTable =
border=3D0 cellspacing=3D0 cellpadding=3D0 =
style=3D'border-collapse:collapse'><tr style=3D'height:47.0pt'><td =
valign=3Dtop style=3D'border:none;border-right:solid #4F81BD =
1.0pt;padding:0cm 5.4pt 0cm 5.4pt;height:47.0pt'><p =
style=3D'margin-bottom:0cm;margin-bottom:.0001pt'><span =
style=3D'font-size:9.5pt;font-family:"Arial","sans-serif";color:#500050'>=
<img border=3D0 width=3D108 height=3D77 id=3D"_x0000_i1026" =
src=3D"http://www.whishworks.com/images/whishworks/WWlogotm.png"><o:p></o=
:p></span></p></td><td valign=3Dtop style=3D'padding:0cm 5.4pt 0cm =
5.4pt;height:47.0pt;background-image:initial;background-repeat:initial'><=
p =
style=3D'margin:0cm;margin-bottom:.0001pt;background-image:initial;backgr=
ound-repeat:initial'><b><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#0B5394'=
>Vijay Bhoomireddy</span></b><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>,=
&nbsp;Big Data Architect</span><span =
style=3D'font-size:9.5pt;font-family:"Arial","sans-serif";color:#500050'>=
<o:p></o:p></span></p><p =
style=3D'margin:0cm;margin-bottom:.0001pt;background-image:initial;backgr=
ound-repeat:initial'><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>1=
000 Great West Road, Brentford, London, TW8 9DW</span><span =
style=3D'font-size:9.5pt;font-family:"Arial","sans-serif";color:#500050'>=
<o:p></o:p></span></p><div><p class=3DMsoNormal><b><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>T=
</span></b><b><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#002060'=
>:</span></b><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#002060'=
>&nbsp; <a href=3D"tel:%2B44%2020%203475%207980" target=3D"_blank">+44 =
20 3475 7980</a></span><span =
style=3D'font-size:9.5pt;font-family:"Arial","sans-serif";color:#500050'>=
<o:p></o:p></span></p></div><div><p class=3DMsoNormal><b><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:black'>M=
:</span></b><b><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#002060'=
>&nbsp;</span></b><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#002060'=
><a href=3D"tel:%2B44%207481%20298%20360" target=3D"_blank">+44 7481 298 =
360</a></span><span =
style=3D'font-family:"Arial","sans-serif";color:#500050'><o:p></o:p></spa=
n></p></div><div><p class=3DMsoNormal><b><span =
style=3D'font-family:"Arial","sans-serif";color:black'>W:</span></b><b><s=
pan =
style=3D'font-family:"Arial","sans-serif";color:#500050'>&nbsp;</span></b=
><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#500050'=
><a href=3D"http://www.whishworks.com/" target=3D"_blank"><span =
style=3D'color:#1155CC'>ww</span></a></span><span =
style=3D'font-family:"Arial","sans-serif";color:#500050'><a =
href=3D"http://www.whishworks.com/" target=3D"_blank"><span =
style=3D'font-size:10.0pt;color:#1155CC'>w.whishworks.com</span></a><o:p>=
</o:p></span></p></div><p =
style=3D'margin:0cm;margin-bottom:.0001pt;background-image:initial;backgr=
ound-repeat:initial'><span =
style=3D'font-size:9.5pt;font-family:"Arial","sans-serif";color:#500050'>=
<a href=3D"https://www.linkedin.com/company/whishworks" =
target=3D"_blank"><span =
style=3D'color:#1155CC;text-decoration:none'><img border=3D0 width=3D23 =
height=3D23 id=3D"_x0000_i1027" =
src=3D"http://www.whishworks.com/images/whishworks/icons/social/linkedin.=
png"></span></a>&nbsp;<a href=3D"http://www.whishworks.com/blog/" =
target=3D"_blank"><span =
style=3D'color:#1155CC;text-decoration:none'><img border=3D0 width=3D22 =
height=3D22 id=3D"_x0000_i1028" =
src=3D"http://www.whishworks.com/images/whishworks/icons/social/blog.png"=
></span></a>&nbsp;<a href=3D"https://twitter.com/WHISHWORKS" =
target=3D"_blank"><span =
style=3D'color:#1155CC;text-decoration:none'><img border=3D0 width=3D22 =
height=3D22 id=3D"_x0000_i1029" =
src=3D"http://www.whishworks.com/images/whishworks/icons/social/twitter.p=
ng"></span></a>&nbsp;<a href=3D"https://www.facebook.com/whishworksit" =
target=3D"_blank"><span =
style=3D'color:#1155CC;text-decoration:none'><img border=3D0 width=3D23 =
height=3D23 id=3D"_x0000_i1030" =
src=3D"http://www.whishworks.com/images/whishworks/icons/social/facebook.=
png"></span></a><o:p></o:p></span></p></td></tr></table></div></div></div=
></div></div></div></div></div></div></div></div></div></div></div><p =
class=3DMsoNormal><br><span =
style=3D'font-family:"Arial","sans-serif";color:#222222'>The contents of =
this e-mail are confidential and for the exclusive use of the intended =
recipient. If you receive this e-mail in error please delete it from =
your system immediately and notify us either by e-mail or telephone. You =
should not copy, forward or otherwise disclose the content of the =
e-mail. The views expressed in this communication may not necessarily be =
the view held by WHISHWORKS.</span> =
<o:p></o:p></p></div></div></div></div><p class=3DMsoNormal><br><br =
clear=3Dall><o:p></o:p></p><div><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div><p class=3DMsoNormal>-- =
<o:p></o:p></p><div><p class=3DMsoNormal>Harsh =
J<o:p></o:p></p></div></div></div></div></body></html>
------=_NextPart_000_03F2_01D06746.4CF63580--