Return-Path: X-Original-To: apmail-nifi-users-archive@minotaur.apache.org Delivered-To: apmail-nifi-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 23132187FF for ; Wed, 4 Nov 2015 06:04:19 +0000 (UTC) Received: (qmail 78548 invoked by uid 500); 4 Nov 2015 06:04:18 -0000 Delivered-To: apmail-nifi-users-archive@nifi.apache.org Received: (qmail 78518 invoked by uid 500); 4 Nov 2015 06:04:18 -0000 Mailing-List: contact users-help@nifi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@nifi.apache.org Delivered-To: mailing list users@nifi.apache.org Received: (qmail 78506 invoked by uid 99); 4 Nov 2015 06:04:18 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Nov 2015 06:04:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E9FD818027B for ; Wed, 4 Nov 2015 06:04:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.898 X-Spam-Level: ** X-Spam-Status: No, score=2.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id vYR9K8JA_4hc for ; Wed, 4 Nov 2015 06:04:17 +0000 (UTC) Received: from mail-io0-f193.google.com (mail-io0-f193.google.com [209.85.223.193]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id D94FA20FF5 for ; Wed, 4 Nov 2015 06:04:16 +0000 (UTC) Received: by ioii189 with SMTP id i189so4134427ioi.1 for ; Tue, 03 Nov 2015 22:04:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=bc5vrNR8ZlLG2H/YYF0eczNFWVi+eStoap9p8HW1VRA=; b=Qk21FjQoQ0lz9xXA6faXE9MMlb8+ybdmhsgv14jLJ+OVDXKKIqloeZdOvJBsFa/zRS UHaxvVNjUCIdKDdMXizOL/sviCNoPrW8sDD+7jY6QcnQOE738zl8G7l7aSw6+vOZqH3t mPNj4ReYFiRiM8n/gonB9kzIXCkkzrkuLRmClWWOrm9xiaO7fW6M6emXr6pkHzAgC/7v BcPOwJ7+tes+321AkcdOo8upSyLoSdR14JOWgw5fSugTE/WIijihuQ2nSYbpQFGGiRcM 0FvDpwr7R5LfWgYflFT/XGkGZgfsyR/QAlQMRG4PpvpZdDsXHE7oqWpk2I45+nJzn/FN tJAw== MIME-Version: 1.0 X-Received: by 10.107.135.13 with SMTP id j13mr1096057iod.101.1446617056267; Tue, 03 Nov 2015 22:04:16 -0800 (PST) Received: by 10.107.39.1 with HTTP; Tue, 3 Nov 2015 22:04:16 -0800 (PST) Date: Wed, 4 Nov 2015 01:04:16 -0500 Message-ID: Subject: Suggestion on how to parse field out of filename From: Mark Petronic To: users@nifi.apache.org Content-Type: multipart/alternative; boundary=001a113f8d2ec3d4a70523b0c7dd --001a113f8d2ec3d4a70523b0c7dd Content-Type: text/plain; charset=UTF-8 Looking for some help on best way to extract a field from a filename. I need to parse out the date from the core filename attribute set by the UnpackContent processor. I am unzipping files that contain many CSV files and these CSV file names vary in format but each has a timestamp included in the filename. Example formats are: Priority_002_20151104123456_00.csv (20151104123456 is yyyyMMddHHmmss) ABC_02_1447586912344.csv (1447586912344 is Unix time in ms) XYZ_20151104_1234.csv (20151104_1234 is yyyyMMdd_HHmm) So, there are various forms to deal with. I need to normalize these into yyyyMMddHHmmss. A regex with capture groups would be perfect but I cannot quite figure out how to do it. ExtractText does regex with capture groups but only against flowfile contents and these are attributes. UpdateAttribute only support expression language and that does not have regex based extracts of capture groups. In Python, I would just do something like: date, time = re.search(r"XYZ_(\d+)_(\d+)\.csv", "XYZ_20151104_1234.csv").groups() Then I could use the expression language format or doDate functions to normalize the dates I know I could use a utility script with ExecuteStreamCommand that I could call with the filepath and get back the tokens but was looking for an internal way to do it without forking out as there are a lot of archives in each zip and that would add to latency in heavy loads. Any thoughts? Thanks! --001a113f8d2ec3d4a70523b0c7dd Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Looking for some help on best way to extract a field from = a filename. I need to parse out the date from the core filename attribute s= et by the UnpackContent processor. I am unzipping files that contain many C= SV files and these CSV file names vary in format but each has a timestamp i= ncluded in the filename. Example formats are:

Priority_0= 02_20151104123456_00.csv =C2=A0(20151104123456 is yyyyMMddHHmmss)
ABC_02_1447586912344.csv (1447586912344 is Unix time in ms)
XYZ_= 20151104_1234.csv (20151104_1234 is yyyyMMdd_HHmm)

So, there are various forms to deal with. I need to normalize these into y= yyyMMddHHmmss. A regex with capture groups would be perfect but I cannot qu= ite figure out how to do it. ExtractText does regex with capture groups but= only against flowfile contents and these are attributes. UpdateAttribute o= nly support expression language and that does not have regex based extracts= of capture groups.

In Python, I would just do som= ething like:

date, time =3D re.search(r"XYZ_(= \d+)_(\d+)\.csv", "XYZ_20151104_1234.csv").groups()

Then I could use the expression language format or doDate = functions to normalize the dates

I know I could us= e a utility script with ExecuteStreamCommand that I could call with the fil= epath and get back the tokens but was looking for an internal way to do it = without forking out as there are a lot of archives in each zip and that wou= ld add to latency in heavy loads.

Any thoughts?

Thanks!

--001a113f8d2ec3d4a70523b0c7dd--