nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars Winderling <lars.winderl...@posteo.de>
Subject Re: [EXT] Re: Duplicate flow files *without* their content
Date Thu, 01 Aug 2019 07:42:20 GMT
Hi Peter,
took me some time to understand your suggestion. Great, thank you!
Have a great day and take care.Best,Lars
On Wed, 2019-07-31 at 17:53 +0000, Peter Wicks (pwicks) wrote:
> Lars,
> If you are worried about it, using ReplaceText will have the same effect as your custom
solution. When ReplaceText has
> it's `Replacement Strategy` set to `Always Replace` it doesn't read the contents of the
FlowFile and simply writes out
> the replacement Value, which in your case could be an empty string.
> Thanks,  Peter
> From: Lars Winderling <lars.winderling@posteo.de>Sent: Wednesday, July 31, 2019
11:02 AMTo: dev@nifi.apache.org
> Subject: [EXT] Re: Duplicate flow files *without* their content
> Hi Edward,
> thank you for your input. I didn't know about the cow-semantics, that's really useful.
I'll check out the in-depth
> guide for sure!In my case, the content of the flow file does change heavily from one
processor to the next one, so I
> doubt copy-on-write would help here.
> Best,Lars
> On Wed, 2019-07-31 at 12:13 +0100, Edward Armes wrote:
> HI Lars,
> 
> 
> In short. depending on the how a FlowFile is duplicated, the content
> shouldn't be duplicated as well.
> 
> 
> In general, content is only duplicated when it has been deemed to have been
> changed (copy-on-write semantics). For the most part (unless a FlowFIle has
> a large number of attributes) a FlowFile is actually quite small and
> therefore the waste is minimal, hence why they can be held in memory and
> passed through a Flow.
> 
> 
> The best way to branch/clone a flow file is to add another output from the
> processor you want to log the output from, and the Framework that surrounds
> a Processor will handle the rest. This does create a duplicate FlowFIle but
> doesn't create a copy of the content. In the provenance repository this
> marked as a CLONE event for the original FlowFIle and the new FlowFile gets
> treated as it's own unique FlowFIle with a reference to the original
> content.
> 
> 
> This is quite a short explanation, and a better and more in depth
> explanation can be found here and I think this covers all the scenarios
> you're thinking about:<https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html>
> https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
> 
> 
> .
> 
> 
> 
> 
> Edward
> 
> 
> On Wed, Jul 31, 2019 at 11:47 AM Lars Winderling <<mailto:lars.winderling@posteo.de>
> lars.winderling@posteo.de<mailto:lars.winderling@posteo.de>
> 
> 
> wrote:
> 
> 
> Dear NiFi community,
> 
> 
> I often face the use-case where I import flow files with content of order
> O(1gb) or O(10gb) - already compressed.
> Let's day I need to branch off of a flow where the actual flow file should
> be processed further, and one some side branch I want just to do some kind
> of logging or whatever without accessing the flow file's contents. Thus
> it's clearly wasteful to duplicate the flow file including content.
> For this case I wrote a processor defining 2 relationships: "original" and
> "attributes only", so the flow file attributes can be accessed separately
> from the content.
> I will gladly prepare a PR if anyone finds that worth incorporating into
> NiFi.
> Only remaining question for me would be: use an individual processor to
> that end, or add it to e.g. the DuplicateFlowFile processor. The former
> seems cleaner to me. Proposed names would be something like ForkProcessor
> (no better idea yet).
> 
> 
> Thanks in advance!
> Best,
> Lars
> 

Mime
View raw message