Mailing-List: contact users-help@nifi.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@nifi.apache.org
MIME-Version: 1.0
Date: Wed, 4 Nov 2015 01:04:16 -0500
Message-ID: 
 <CAG_xVVSHhTZNntE6zOB+RYPhAneaQ=qnnGyX+s9UxL=ZH9YYYA@mail.gmail.com>
Subject: Suggestion on how to parse field out of filename
From: Mark Petronic <markpetronic@gmail.com>
To: users@nifi.apache.org
Content-Type: multipart/alternative; boundary=001a113f8d2ec3d4a70523b0c7dd

--001a113f8d2ec3d4a70523b0c7dd
Content-Type: text/plain; charset=UTF-8

Looking for some help on best way to extract a field from a filename. I
need to parse out the date from the core filename attribute set by the
UnpackContent processor. I am unzipping files that contain many CSV files
and these CSV file names vary in format but each has a timestamp included
in the filename. Example formats are:

Priority_002_20151104123456_00.csv  (20151104123456 is yyyyMMddHHmmss)
ABC_02_1447586912344.csv (1447586912344 is Unix time in ms)
XYZ_20151104_1234.csv (20151104_1234 is yyyyMMdd_HHmm)

So, there are various forms to deal with. I need to normalize these into
yyyyMMddHHmmss. A regex with capture groups would be perfect but I cannot
quite figure out how to do it. ExtractText does regex with capture groups
but only against flowfile contents and these are attributes.
UpdateAttribute only support expression language and that does not have
regex based extracts of capture groups.

In Python, I would just do something like:

date, time = re.search(r"XYZ_(\d+)_(\d+)\.csv",
"XYZ_20151104_1234.csv").groups()

Then I could use the expression language format or doDate functions to
normalize the dates

I know I could use a utility script with ExecuteStreamCommand that I could
call with the filepath and get back the tokens but was looking for an
internal way to do it without forking out as there are a lot of archives in
each zip and that would add to latency in heavy loads.

Any thoughts?

Thanks!

--001a113f8d2ec3d4a70523b0c7dd
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Looking for some help on best way to extract a field from =
a filename. I need to parse out the date from the core filename attribute s=
et by the UnpackContent processor. I am unzipping files that contain many C=
SV files and these CSV file names vary in format but each has a timestamp i=
ncluded in the filename. Example formats are:<div><br></div><div>Priority_0=
02_20151104123456_00.csv =C2=A0(20151104123456 is yyyyMMddHHmmss)</div><div=
>ABC_02_1447586912344.csv (1447586912344 is Unix time in ms)</div><div>XYZ_=
20151104_1234.csv (20151104_1234 is yyyyMMdd_HHmm)</div><div><br></div><div=
>So, there are various forms to deal with. I need to normalize these into y=
yyyMMddHHmmss. A regex with capture groups would be perfect but I cannot qu=
ite figure out how to do it. ExtractText does regex with capture groups but=
 only against flowfile contents and these are attributes. UpdateAttribute o=
nly support expression language and that does not have regex based extracts=
 of capture groups.</div><div><br></div><div>In Python, I would just do som=
ething like:</div><div><br></div><div>date, time =3D re.search(r&quot;XYZ_(=
\d+)_(\d+)\.csv&quot;, &quot;XYZ_20151104_1234.csv&quot;).groups()</div><di=
v><br></div><div>Then I could use the expression language format or doDate =
functions to normalize the dates</div><div><br></div><div>I know I could us=
e a utility script with ExecuteStreamCommand that I could call with the fil=
epath and get back the tokens but was looking for an internal way to do it =
without forking out as there are a lot of archives in each zip and that wou=
ld add to latency in heavy loads.</div><div><br></div><div>Any thoughts?</d=
iv><div><br></div><div>Thanks!</div><div><br></div></div>

--001a113f8d2ec3d4a70523b0c7dd--