nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Kawamura <ijokaruma...@gmail.com>
Subject Re: Safeguarding against List/Fetch of partial files
Date Tue, 18 Jul 2017 01:34:41 GMT
Forgot to put a link to the implementation:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ListFile.java#L319

MIN_AGE is not the only property, ListFile resets state if these
properties are changed. So, after reconfiguring these, you may get the
same file listed.

return DIRECTORY.equals(property)
|| RECURSE.equals(property)
|| FILE_FILTER.equals(property)
|| PATH_FILTER.equals(property)
|| MIN_AGE.equals(property)
|| MAX_AGE.equals(property)
|| MIN_SIZE.equals(property)
|| MAX_SIZE.equals(property)
|| IGNORE_HIDDEN_FILES.equals(property);

On Tue, Jul 18, 2017 at 10:32 AM, Koji Kawamura <ijokarumawak@gmail.com> wrote:
> Hi James, Pierre,
>
> ListFile resets its state (including what is the latest entry it
> listed) when min file age is changed. ListFile.isListingResetNecessary
> implements the behavior.
>
> Thanks,
> Koji
>
> On Tue, Jul 18, 2017 at 2:42 AM, Pierre Villard
> <pierre.villard.fr@gmail.com> wrote:
>> Hi James,
>>
>> This parameter should not change the behavior of the processor regarding
>> files already listed in previous trigger executions of the processor. Could
>> it be possible that old files have been somehow modified by another process?
>> That would explain why the processor listed the files one more time. If you
>> can reproduce the issue, that's certainly a bug IMO.
>>
>> Thanks
>> Pierre
>>
>> 2017-07-14 18:13 GMT+02:00 James McMahon <jsmcmahon3@gmail.com>:
>>>
>>> Joe, I have a follow-up question. If I set Minimum File Age to 60 sec in
>>> my ListFile processor, that does not override the typical behavior in which
>>> ListFile does not include in its list output any file preceding its previous
>>> run cycle, does it? I ask because I set Minimum File Age to 60 sec and have
>>> seen a flood of additional files. Many of those files have date stamps
>>> preceding the ListFile runs that have been executing over the course of the
>>> last few days. I am trying to determine why this might be the case.
>>>
>>> Thanks for any thoughts or insights.
>>>
>>> On Fri, Jul 14, 2017 at 10:11 AM, James McMahon <jsmcmahon3@gmail.com>
>>> wrote:
>>>>
>>>> Thank you very much Joe. This is very good to know. We are indeed working
>>>> with CentOS, and so I can explore with my users using a '.' prefix while
>>>> working with the file, and renaming it when done.
>>>>
>>>> But I'm not certain I can levy that requirement on my users, or depend on
>>>> them to always enforce it. So in combination with that I will use a Minimum
>>>> File Age of 30 or 60 seconds in my ListFile processors. That should be more
>>>> than ample margin, and since my ListFile runs with the default Yield
>>>> duration of 1 sec, the files will be picked up in a subsequent processor
run
>>>> quite rapidly. Thanks again.
>>>>
>>>> On Fri, Jul 14, 2017 at 9:53 AM, Joe Witt <joe.witt@gmail.com> wrote:
>>>>>
>>>>> Jim,
>>>>>
>>>>> Ultimately this comes down to whether any consuming process (not just
>>>>> NiFi) can reliably know that a given file is 'ready to be consumed'.
>>>>> If the writer of those files offers no 'protocol' by which you can
>>>>> know then unfortunately it is about 'having a pretty good guess' that
>>>>> they're done.
>>>>>
>>>>> One of the simpler and more reliable ways to know the file writer is
>>>>> done is that the file writer changes the name of the file when it is
>>>>> done.  Most common pattern here is they write the file with a name
>>>>> prepended with a 'dot'.  In *nix this is often considered a 'hidden'
>>>>> file.  The ListFile processor does this by default.
>>>>>
>>>>> After that it is a set of less awesome options.  The next most
>>>>> reliable option most likely is to use the file age (based on
>>>>> modification time) and ListFile makes this available to you.  The
>>>>> problem with this is that you're not guaranteed it will be updated and
>>>>> administrators of systems can disable updates to modification time if
>>>>> they want to.  However, if in your case this is a reliable option you
>>>>> could use that.
>>>>>
>>>>> We could also add something to the processor to make listings slower
>>>>> whereby it would scan a couple times to see if the file size is still
>>>>> changing.  But this is also not very reliable.
>>>>>
>>>>> In short, the processor gives you options to handle this but you also
>>>>> have to keep in mind that unless there is some reliable 'protocol'
>>>>> here you are basically guessing at whether the file is ready.  This is
>>>>> a 'how file IO works' thing more than a what a NiFi can do thing.
>>>>>
>>>>> Thanks
>>>>> Joe
>>>>>
>>>>>
>>>>> On Fri, Jul 14, 2017 at 9:39 AM, James McMahon <jsmcmahon3@gmail.com>
>>>>> wrote:
>>>>> > A fundamental question was asked by one of the consumers who depend
on
>>>>> > my
>>>>> > NiFi workflows for transfer of critical data. I wasn't entirely
>>>>> > certain of
>>>>> > the answer and feel that I really should better understand this.
>>>>> >
>>>>> > When using a ListFile/FetchFile combo or even a simple GetFile
>>>>> > processor,
>>>>> > how does Nifi ensure that it does not ingest from a targeted directory
>>>>> > any
>>>>> > files that an external process may still be writing to or editing?
Is
>>>>> > it
>>>>> > bound by file locks that have been established at the system level
by
>>>>> > those
>>>>> > external processes?
>>>>> >
>>>>> > Thanks in advance for any insights that help me better explain this
to
>>>>> > consumers of my NiFi workflows.  -Jim
>>>>
>>>>
>>>
>>

Mime
View raw message