nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy LoPresto <>
Subject Re: ListSFTP incoming relationship
Date Thu, 29 Mar 2018 16:43:40 GMT

I think there are two conversations going on here. You are finding the requirements for your
specific use case, and that’s great. But I echo Bryan’s point that a community processor
for this scenario should not store state at all. Sivaprasanna’s point that given dynamic
directory input, storing state based on that can cause massive data ingestion problems still

For your specific use case, you can prototype (or possibly even get to a stable and robust-enough
point) using ExecuteScript to model the behavior you need.

In regards to the desired output format, I would suggest a few items:

* Avro requires a schema to be defined, and this raises the barrier to use of the processor.
Also, unless being sent to a processor that understands Avro, the result will need to be converted
anyway using Record* processors.
* If the output is individual flowfiles on a 1:1 basis, each should have as many attributes
populated with the parsed information as possible (i.e., file.path, file.size, file.owner,
file.permissions, etc.). This allows for easily-consumable and routable flowfiles.
* If the output is a full directory listing, I would suggest `ls -al` type raw text output,
or JSON (arbitrary human-readable and machine-readable format with many consuming/transforming

Andy LoPresto
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Mar 29, 2018, at 9:34 AM, scott <> wrote:
> Sorry Bryan, but I disagree with you. Not storing state is NOT the main point of this
new processor. The main point is to allow an incoming relationship flowfile to trigger the
action, and allow variables to be used from the attributes therein.
> I agree that if the NiFi community deems it too risky to distribute this processor with
state keeping optionally available, even if the default is to disable it, then so be it. If
state is not included optionally, then how about making the output flowfile content include
more than just the file names? Have it include last updated time along with the filename.
If it searches recursively, you'll want to include the path to the file also. Maybe it would
be best to output the results into a structured format, such as AVRO? Or, maybe it would just
be best to output one flowfile per remote file found, and include updated time and fully qualified
path as attributes?
> Scott
> On 03/29/2018 04:32 AM, Bryan Bende wrote:
>> The main point of the new processor is to NOT store state so that it
>> becomes more reasonable to allow incoming flow files.
>> You could probably implement your own custom processor that does both
>> because you can make assumptions about how you are going to use it, but if
>> the NiFi community provides one then it needs to work well for all
>> situations, such as dynamically listing hundreds of directories, which is
>> problematic when state is involved.
>> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <>
>> wrote:
>>> Should we really have to have an optional state saving functionality? If
>>> the user is unaware of the implications and proceed to store the state then
>>> what Andrew Grande mentioned will happen - possibilities of never ending
>>> stream of state information being stored. If we still go with the optional
>>> state management approach, documentation have to be clear in explaining the
>>> implications.
>>> Sivaprasanna
>>> On Thu, 29 Mar 2018 at 9:28 AM, scott <> wrote:
>>>> Okay. So, a new processor called "ScanSFTP", allow incoming relationship
>>>> where the content of the flow file is replaced with the list of matching
>>>> files from the remote directory, then the list is filtered by the usual
>>>> regex parameters like today. Optional state information is kept to
>>>> additionally filter the list of files older than the newest file
>>>> observed during the last run. Does that sound okay to everyone? If so,
>>>> what's the next step?
>>>> Scott
>>>> On 03/27/2018 06:21 PM, scott wrote:
>>>>> This is a great discussion, and appreciate the interest in my problem.
>>>>> I think there are workarounds if you decide not to store state, but
>>>>> I'd recommend keeping it. I think state should be kept optionally,
>>>>> even turned off by default. Several times I've had issues where the
>>>>> state has cause me to miss files, because files get moved into the
>>>>> source folder out of order, and I've wished I could turn the state
>>>>> feature off.
>>>>> In my current use-case, I would not be frequently, dynamically
>>>>> changing the source directory, though I can see the use-cases where it
>>>>> would be. In my current use-case, I want to use an external database
>>>>> table to control the configuration of all my flows. I do this by first
>>>>> reading the content of the table for this particular flow ID, then
>>>>> assign the result as attributes to the flowfile, essentially creating
>>>>> variables I can use throughout the flow to control its behavior. This
>>>>> works great with flows that initiate with HTTP or SQL, but not
>>>>> ListSFTP or ListFile.
>>>>> Scott
>>>>> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>>>>>> I think Bryan’s point is a good one and when I first saw this
>>>>>> question (and thought of the previous times it’s been asked), my
>>>>>> initial response is to propose a second processor.
>>>>>> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP”
which operates
>>>>>> differently from ListSFTP — it does not maintain state, and performs
>>>>>> a one-time tabulation/chronicling of the state of that directory
>>>>>> the given point in time.
>>>>>> The responsibility to maintain and compare state across time is no
>>>>>> longer a requirement. There could even be a setting in the processor
>>>>>> to allow for “individual flowfile output” (i.e. act the same
>>>>>> ListSFTP and output one flowfile per item listed) or “summary
>>>>>> flowfile output” where a single flowfile is generated containing
>>>>>> directory listing information for all the items there. (Another
>>>>>> option is to output both on two different relationships).
>>>>>> I think this would enable the types of workflows that users have
>>>>>> asked about in the past without compromising the mechanism by which
>>>>>> List* processors work and adding undue complexity to those processors.
>>>>>> Absolutely crystal clear documentation (and a standard verb for the
>>>>>> new processor family) would be necessary (not only because these
>>>>>> processor solve different problems, but to avoid a million variants
>>>>>> of “I used ScanSFTP processor and it’s not tracking state”/“How
do I
>>>>>> provide a directory in an attribute to ListSFTP” mailing list
>>>>>> questions).
>>>>>> Andy LoPresto
>>>>>> <>
>>>>>> / <>/
>>>>>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>>>>>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <
>>>>>>> <>> wrote:
>>>>>>> The key here is that ListXXX processor maintains state. A directory
>>>>>>> is part
>>>>>>> of such state. Allowing arbitrary directories via an expression
>>>>>>> create never ending stream of new entries in the state storage,
>>>>>>> effectively
>>>>>>> engineering a distributed DoS attack on the NiFi node or shared
>>>>>>> quorum
>>>>>>> (for when state is stored in there).
>>>>>>> Maybe if we focus on thinking about assumptions and restrictions
>>>>>>> processor should make to contain that risk...
>>>>>>> Andrew
>>>>>>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <
>>>>>>> <>> wrote:
>>>>>>>> I'm not sure that would solve the problem because you'd still
>>>>>>>> limited to one directory. What most people are asking for
is the
>>>>>>>> ability to use a dynamic directory from an incoming flow
>>>>>>>> I think we might be trying to fit two different use-cases
into one
>>>>>>>> processor which might not make sense.
>>>>>>>> Scenario #1... There is a directory that is constantly receiving
>>>>>>>> data and has a significant amount of files, and I want to
>>> periodically
>>>>>>>> find new files. This is what the current processors are optimized
>>> for.
>>>>>>>> Scenario #2... There is a directory that is mostly static
with a
>>>>>>>> moderate/small number of files, and at points in my flow
I want to
>>>>>>>> dynamically perform a listing of this directory and retrieve
>>>>>>>> files. This is more geared towards the mentality of running
>>>>>>>> job/workflow.
>>>>>>>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
>>>>>>>> < <>>
>>>>>>>> wrote:
>>>>>>>>> What if the changes where ‘on top of’ some base set
of properties,
>>>>>>>>> like
>>>>>>>>> directory?
>>>>>>>>> Like a filter, where if present from the incoming file
will have
>>> the
>>>>>>>> LIST*
>>>>>>>>> list only things
>>>>>>>>> that match a name or attribute?
>>>>>>>>> On March 27, 2018 at 00:08:41, Joe Witt (
>>>>>>>>> <>) wrote:
>>>>>>>>> Scott
>>>>>>>>> This idea has come up a couple of times and there is
>>>>>>>>> something intriguing to it. Where I think this idea stalls
>>> though
>>>>>>>>> is in implementation.
>>>>>>>>> While I agree that the other List* processors might similarly
>>> benefit
>>>>>>>>> lets focus on ListFile. Today you tell ListFile what
directory to
>>>>>>>>> start looking for files in. It goes off scanning that
directory for
>>>>>>>>> hits and stores state about what it has already searched/seen.
>>> it
>>>>>>>>> is important to keep track of how much it has already
>>> because
>>>>>>>>> at times the search directory can be massive (100,000s
of thousands
>>>> or
>>>>>>>>> more files and directories to scan for example).
>>>>>>>>> In the proposed model the directory to be scanned could
be provided
>>>>>>>>> dynamically by looking at an attribute of an incoming
flowfile (or
>>>>>>>>> other criteria can be provided - not just the directory
to scan).
>>> In
>>>>>>>>> this case the ListFile processor goes on scanning against
that now.
>>>>>>>>> What about the previous directory (or directories) it
was told to
>>>>>>>>> scan? Does it still track those too? What if it starts
scanning the
>>>>>>>>> newly provided directory, hasn't finished pulling all
the data or
>>> new
>>>>>>>>> data is continually arriving, and it is told to switch
to another
>>>>>>>>> directory.
>>>>>>>>> I think if those questions can get solid answers and
>>> invests
>>>>>>>>> time in creating a PR then this could be pretty powerful.
Would be
>>>>>>>>> good to see a written description of the use case(s)
for this too.
>>>>>>>>> Thanks
>>>>>>>>> Joe
>>>>>>>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <
>>>>>>>>> <>> wrote:
>>>>>>>>>> Hello Devs,
>>>>>>>>>> I would like to request a feature to a major processor,
>>>> But
>>>>>>>>> before
>>>>>>>>>> I do down the official road, I wanted to ask if anyone
thought it
>>>>>>>>>> was a
>>>>>>>>>> terrible idea or impossible, etc. The request is
to add support
>>>>>>>>>> for an
>>>>>>>>>> incoming relationship to the ListSFTP processor specifically,
>>> I
>>>>>>>> could
>>>>>>>>>> see it added to many of the commonly used head processes,
such as
>>>>>>>>> ListFile.
>>>>>>>>>> I would envision functionality more like InvokeHTTP
>>>>>>>>>> ExecuteSQL, where
>>>>>>>>> an
>>>>>>>>>> incoming flow file could initiate the action, and
the attributes
>>>>>>>>>> in the
>>>>>>>>>> incoming flow file could be used to configure the
>>> actions.
>>>>>>>> It's
>>>>>>>>>> the configuration aspect that most appeals to me,
because it
>>>>>>>>>> opens it up
>>>>>>>>> to
>>>>>>>>>> being centrally or dynamically configured.
>>>>>>>>>> Thanks,
>>>>>>>>>> Scott

View raw message