nifi-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Sampson (Jira)" <j...@apache.org>
Subject [jira] [Created] (NIFI-7145) Chained SplitText processors unable to handle files in some circumstances
Date Thu, 13 Feb 2020 18:36:00 GMT
Chris Sampson created NIFI-7145:
-----------------------------------

             Summary: Chained SplitText processors unable to handle files in some circumstances
                 Key: NIFI-7145
                 URL: https://issues.apache.org/jira/browse/NIFI-7145
             Project: Apache NiFi
          Issue Type: Bug
    Affects Versions: 1.11.1
         Environment: Docker Image (apache/nifi) running in Kubernetes (1.15)
            Reporter: Chris Sampson
         Attachments: Broken_SplitText.json, Broken_SplitText.xml, test.csv.tgz

With chained SplitText processors (NiFi 1.11.1 apache/nifi Docker image with default nifi.properties,
although configured to allow secure access in my environment with encrypted flowfile/provenance/content
repositories, don't know whether that makes a difference): * ingest 40MB CSV file with 50k
lines of data (plus 1 header)
 * SplitText - chunk the file into 10k segments (including header in each file)
 * SplitText - break each row out into its own FlowFile

 
The 10k chunking works fine, but then the files sit in the queue between the processors forever
with the second SplitText sat showing it’s working but never actually produces anything
(can’t see anything in the logs, although haven’t turned on debug logging to see whether
that would provide anything more).
 
If I reduce the chunk size to 1k then the per-row split works fine - maybe some sort of issue
with SplitText and/or swapping of FlowFiles/content to the repositories?
 
Example Flow/Template attached with file that breaks the flow (untar and copy into /tmp).
Second SplitText set to Concurrency=3 in the template, but fails just the same when set to
default Concurrency=1.
 
SplitRecord would be an alternative (which works fine when I try it), but I can’t use that
as we potentially lose data if the CSV is malformed (there are more data fields in a row that
defined headers - the extra fields are thrown away by the Record processors, which I understand
to be normal and that’s fine, but unfortunately I later need to ValidateRecord for each
of these rows to check for this kind of invalidity).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message