nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From scott <tcots8...@gmail.com>
Subject Re: ListSFTP incoming relationship
Date Tue, 03 Apr 2018 00:41:55 GMT
Pierre,

That sounds good. I'll work on the requirements and create a Jira this 
week, so that I can get started.

Thanks to all for your feedback.


Scott


On 04/01/2018 10:06 AM, Pierre Villard wrote:
> Hi Scott,
>
> In my opinion, based on the discussion here, I'd suggest you to implement
> the solution that you seem best to answer your needs and also taking in
> consideration all the feedback the community provided. Once you have
> something, best is to submit a pull request so that review and discussion
> can move forward on the implementation itself. I'd also recommend to file a
> JIRA with as much details as possible on what is the need, what are the
> options on the table and what is the implementation you want to propose
> (the more technical details you give, the sooner you'll get feedback for
> your code).
>
> Pierre
>
>
>
> 2018-04-01 18:40 GMT+02:00 scott <tcots8888@gmail.com>:
>
>> Okay. I guess I didn't realize how Nifi dev felt about risk tolerance. I
>> think I can work around it by adding duplicate filtering or implement some
>> other state management solution.
>> So, what's the next step?
>>
>> Scott
>>
>> On Thu, Mar 29, 2018, 10:46 AM Bryan Bende <bbende@gmail.com> wrote:
>>
>>> Scott,
>>>
>>> You are correct that the overall discussion is about allowing incoming
>>> flow files to ListSFTP.
>>>
>>> However, the previous discussion on this thread highlighted that the
>>> main reason ListSFTP currently doesn't allow incoming flow files is
>>> because of challenges when storing state.
>>>
>>> This led to the proposal of a new processor that allowed incoming flow
>>> files, but did not store state, thus avoiding the challenges mentioned
>>> above. If we were going to store state in this new processor, then
>>> we'd be back to the exact same challenges.
>>>
>>> Providing an option to turn on state also doesn't really help, because
>>> if there is an option provided to users,then the option will be used,
>>> and it needs to work when it is used.
>>>
>>> If we can come up with something that stores state and works well for
>>> all scenarios, then we aren't against it, we just need to handle the
>>> challenges highlighted by Joe's original email.
>>>
>>> Regarding some of the other ideas...
>>>
>>> The current output of ListSFTP already includes flow file attributes
>>> for each listing that include the full path, filename, last update
>>> time, owner, group, permissions, and file size.... were you thinking
>>> of something different than that?
>>>
>>> See the "Writes Attributes" section here:
>>>
>>> https://nifi.apache.org/docs/nifi-docs/components/org.
>> apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.
>> processors.standard.ListSFTP/index.html
>>> Thanks,
>>>
>>> Bryan
>>>
>>>
>>>
>>> On Thu, Mar 29, 2018 at 12:43 PM, Andy LoPresto <alopresto@apache.org>
>>> wrote:
>>>> Scott,
>>>>
>>>> I think there are two conversations going on here. You are finding the
>>>> requirements for your specific use case, and that’s great. But I echo
>>>> Bryan’s point that a community processor for this scenario should not
>>> store
>>>> state at all. Sivaprasanna’s point that given dynamic directory input,
>>>> storing state based on that can cause massive data ingestion problems
>>> still
>>>> stands.
>>>>
>>>> For your specific use case, you can prototype (or possibly even get to
>> a
>>>> stable and robust-enough point) using ExecuteScript to model the
>> behavior
>>>> you need.
>>>>
>>>> In regards to the desired output format, I would suggest a few items:
>>>>
>>>> * Avro requires a schema to be defined, and this raises the barrier to
>>> use
>>>> of the processor. Also, unless being sent to a processor that
>> understands
>>>> Avro, the result will need to be converted anyway using Record*
>>> processors.
>>>> * If the output is individual flowfiles on a 1:1 basis, each should
>> have
>>> as
>>>> many attributes populated with the parsed information as possible (i.e.
>>>> file.name, file.path, file.size, file.owner, file.permissions, etc.).
>>> This
>>>> allows for easily-consumable and routable flowfiles.
>>>> * If the output is a full directory listing, I would suggest `ls -al`
>>> type
>>>> raw text output, or JSON (arbitrary human-readable and machine-readable
>>>> format with many consuming/transforming processors).
>>>>
>>>>
>>>> Andy LoPresto
>>>> alopresto@apache.org
>>>> alopresto.apache@gmail.com
>>>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>>>
>>>> On Mar 29, 2018, at 9:34 AM, scott <tcots8888@gmail.com> wrote:
>>>>
>>>> Sorry Bryan, but I disagree with you. Not storing state is NOT the main
>>>> point of this new processor. The main point is to allow an incoming
>>>> relationship flowfile to trigger the action, and allow variables to be
>>> used
>>>> from the attributes therein.
>>>>
>>>> I agree that if the NiFi community deems it too risky to distribute
>> this
>>>> processor with state keeping optionally available, even if the default
>>> is to
>>>> disable it, then so be it. If state is not included optionally, then
>> how
>>>> about making the output flowfile content include more than just the
>> file
>>>> names? Have it include last updated time along with the filename. If it
>>>> searches recursively, you'll want to include the path to the file also.
>>>> Maybe it would be best to output the results into a structured format,
>>> such
>>>> as AVRO? Or, maybe it would just be best to output one flowfile per
>>> remote
>>>> file found, and include updated time and fully qualified path as
>>> attributes?
>>>> Scott
>>>>
>>>>
>>>> On 03/29/2018 04:32 AM, Bryan Bende wrote:
>>>>
>>>> The main point of the new processor is to NOT store state so that it
>>>> becomes more reasonable to allow incoming flow files.
>>>>
>>>> You could probably implement your own custom processor that does both
>>>> because you can make assumptions about how you are going to use it, but
>>> if
>>>> the NiFi community provides one then it needs to work well for all
>>>> situations, such as dynamically listing hundreds of directories, which
>> is
>>>> problematic when state is involved.
>>>>
>>>> On Thu, Mar 29, 2018 at 1:05 AM Sivaprasanna <
>> sivaprasanna246@gmail.com>
>>>> wrote:
>>>>
>>>> Should we really have to have an optional state saving functionality?
>> If
>>>> the user is unaware of the implications and proceed to store the state
>>> then
>>>> what Andrew Grande mentioned will happen - possibilities of never
>> ending
>>>> stream of state information being stored. If we still go with the
>>> optional
>>>> state management approach, documentation have to be clear in explaining
>>> the
>>>> implications.
>>>>
>>>> Sivaprasanna
>>>>
>>>> On Thu, 29 Mar 2018 at 9:28 AM, scott <tcots8888@gmail.com> wrote:
>>>>
>>>> Okay. So, a new processor called "ScanSFTP", allow incoming
>> relationship
>>>> where the content of the flow file is replaced with the list of
>> matching
>>>> files from the remote directory, then the list is filtered by the usual
>>>> regex parameters like today. Optional state information is kept to
>>>> additionally filter the list of files older than the newest file
>>>> observed during the last run. Does that sound okay to everyone? If so,
>>>> what's the next step?
>>>>
>>>> Scott
>>>>
>>>>
>>>> On 03/27/2018 06:21 PM, scott wrote:
>>>>
>>>> This is a great discussion, and appreciate the interest in my problem.
>>>> I think there are workarounds if you decide not to store state, but
>>>> I'd recommend keeping it. I think state should be kept optionally,
>>>> even turned off by default. Several times I've had issues where the
>>>> state has cause me to miss files, because files get moved into the
>>>> source folder out of order, and I've wished I could turn the state
>>>> feature off.
>>>>
>>>> In my current use-case, I would not be frequently, dynamically
>>>> changing the source directory, though I can see the use-cases where it
>>>> would be. In my current use-case, I want to use an external database
>>>> table to control the configuration of all my flows. I do this by first
>>>> reading the content of the table for this particular flow ID, then
>>>> assign the result as attributes to the flowfile, essentially creating
>>>> variables I can use throughout the flow to control its behavior. This
>>>> works great with flows that initiate with HTTP or SQL, but not
>>>> ListSFTP or ListFile.
>>>>
>>>> Scott
>>>>
>>>>
>>>> On 03/27/2018 02:05 PM, Andy LoPresto wrote:
>>>>
>>>> I think Bryan’s point is a good one and when I first saw this
>>>> question (and thought of the previous times it’s been asked), my
>>>> initial response is to propose a second processor.
>>>>
>>>> Something like “ScanSFTP”/“IndexSFTP”/“SnapshotSFTP” which operates
>>>> differently from ListSFTP — it does not maintain state, and performs
>>>> a one-time tabulation/chronicling of the state of that directory at
>>>> the given point in time.
>>>>
>>>> The responsibility to maintain and compare state across time is no
>>>> longer a requirement. There could even be a setting in the processor
>>>> to allow for “individual flowfile output” (i.e. act the same as
>>>> ListSFTP and output one flowfile per item listed) or “summary
>>>> flowfile output” where a single flowfile is generated containing the
>>>> directory listing information for all the items there. (Another
>>>> option is to output both on two different relationships).
>>>>
>>>> I think this would enable the types of workflows that users have
>>>> asked about in the past without compromising the mechanism by which
>>>> List* processors work and adding undue complexity to those processors.
>>>>
>>>> Absolutely crystal clear documentation (and a standard verb for the
>>>> new processor family) would be necessary (not only because these
>>>> processor solve different problems, but to avoid a million variants
>>>> of “I used ScanSFTP processor and it’s not tracking state”/“How do
I
>>>> provide a directory in an attribute to ListSFTP” mailing list
>>>> questions).
>>>>
>>>>
>>>> Andy LoPresto
>>>> alopresto@apache.org <mailto:alopresto@apache.org>
>>>> /alopresto.apache@gmail.com <mailto:alopresto.apache@gmail.com>/
>>>> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>>>
>>>> On Mar 27, 2018, at 8:33 AM, Andrew Grande <aperepel@gmail.com
>>>> <mailto:aperepel@gmail.com>> wrote:
>>>>
>>>> The key here is that ListXXX processor maintains state. A directory
>>>> is part
>>>> of such state. Allowing arbitrary directories via an expression would
>>>> create never ending stream of new entries in the state storage,
>>>> effectively
>>>> engineering a distributed DoS attack on the NiFi node or shared ZK
>>>> quorum
>>>> (for when state is stored in there).
>>>>
>>>> Maybe if we focus on thinking about assumptions and restrictions the
>>>> processor should make to contain that risk...
>>>>
>>>> Andrew
>>>>
>>>> On Tue, Mar 27, 2018, 9:56 AM Bryan Bende <bbende@gmail.com
>>>> <mailto:bbende@gmail.com>> wrote:
>>>>
>>>> I'm not sure that would solve the problem because you'd still be
>>>> limited to one directory. What most people are asking for is the
>>>> ability to use a dynamic directory from an incoming flow file.
>>>>
>>>> I think we might be trying to fit two different use-cases into one
>>>> processor which might not make sense.
>>>>
>>>> Scenario #1... There is a directory that is constantly receiving new
>>>> data and has a significant amount of files, and I want to
>>>>
>>>> periodically
>>>>
>>>> find new files. This is what the current processors are optimized
>>>>
>>>> for.
>>>>
>>>> Scenario #2... There is a directory that is mostly static with a
>>>> moderate/small number of files, and at points in my flow I want to
>>>> dynamically perform a listing of this directory and retrieve the
>>>> files. This is more geared towards the mentality of running a
>>>> job/workflow.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 27, 2018 at 9:36 AM, Otto Fowler
>>>> <ottobackwards@gmail.com <mailto:ottobackwards@gmail.com>>
>>>> wrote:
>>>>
>>>> What if the changes where ‘on top of’ some base set of properties,
>>>> like
>>>> directory?
>>>> Like a filter, where if present from the incoming file will have
>>>>
>>>> the
>>>>
>>>> LIST*
>>>>
>>>> list only things
>>>> that match a name or attribute?
>>>>
>>>>
>>>>
>>>> On March 27, 2018 at 00:08:41, Joe Witt (joe.witt@gmail.com
>>>> <mailto:joe.witt@gmail.com>) wrote:
>>>>
>>>> Scott
>>>>
>>>> This idea has come up a couple of times and there is definitely
>>>> something intriguing to it. Where I think this idea stalls out
>>>>
>>>> though
>>>>
>>>> is in implementation.
>>>>
>>>> While I agree that the other List* processors might similarly
>>>>
>>>> benefit
>>>>
>>>> lets focus on ListFile. Today you tell ListFile what directory to
>>>> start looking for files in. It goes off scanning that directory for
>>>> hits and stores state about what it has already searched/seen. And
>>>>
>>>> it
>>>>
>>>> is important to keep track of how much it has already scanned
>>>>
>>>> because
>>>>
>>>> at times the search directory can be massive (100,000s of thousands
>>>>
>>>> or
>>>>
>>>> more files and directories to scan for example).
>>>>
>>>> In the proposed model the directory to be scanned could be provided
>>>> dynamically by looking at an attribute of an incoming flowfile (or
>>>> other criteria can be provided - not just the directory to scan).
>>>>
>>>> In
>>>>
>>>> this case the ListFile processor goes on scanning against that now.
>>>> What about the previous directory (or directories) it was told to
>>>> scan? Does it still track those too? What if it starts scanning the
>>>> newly provided directory, hasn't finished pulling all the data or
>>>>
>>>> new
>>>>
>>>> data is continually arriving, and it is told to switch to another
>>>> directory.
>>>>
>>>> I think if those questions can get solid answers and someone
>>>>
>>>> invests
>>>>
>>>> time in creating a PR then this could be pretty powerful. Would be
>>>> good to see a written description of the use case(s) for this too.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Mon, Mar 26, 2018 at 11:58 PM, scott <tcots8888@gmail.com
>>>> <mailto:tcots8888@gmail.com>> wrote:
>>>>
>>>> Hello Devs,
>>>>
>>>> I would like to request a feature to a major processor, ListSFTP.
>>>>
>>>> But
>>>>
>>>> before
>>>>
>>>> I do down the official road, I wanted to ask if anyone thought it
>>>> was a
>>>> terrible idea or impossible, etc. The request is to add support
>>>> for an
>>>> incoming relationship to the ListSFTP processor specifically, but
>>>>
>>>> I
>>>>
>>>> could
>>>>
>>>> see it added to many of the commonly used head processes, such as
>>>>
>>>> ListFile.
>>>>
>>>> I would envision functionality more like InvokeHTTP or
>>>> ExecuteSQL, where
>>>>
>>>> an
>>>>
>>>> incoming flow file could initiate the action, and the attributes
>>>> in the
>>>> incoming flow file could be used to configure the processor
>>>>
>>>> actions.
>>>>
>>>> It's
>>>>
>>>> the configuration aspect that most appeals to me, because it
>>>> opens it up
>>>>
>>>> to
>>>>
>>>> being centrally or dynamically configured.
>>>>
>>>> Thanks,
>>>>
>>>> Scott
>>>>
>>>>
>>>>
>>>>


Mime
View raw message