nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Bende <>
Subject Re: FetchSFTP vs GetSFTP
Date Tue, 31 Oct 2017 20:26:24 GMT

The 10 seconds appears to be a hard-code rule in the processor,
although it seems like it could be turned into a configurable

It would require a code change to make it grab a batch of flow files
during a single execution. In theory it shouldn't provide that much of
a difference, but might be an interesting experiment. It makes the
code more challenging to write though, not that that's a reason not to
do it.

If you have a 5 node cluster, you are doing List on primary node and
then redistributing the results to all the nodes via an RPG so all
nodes can fetch?


On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <> wrote:
> Joe/Bryan Thanks!
> I believe the one specific file per concurrent task/connection (and too
> many threads) is the issue I have we have a lot of small files and often
> times backed up . I'm going to drop the task count to take advantage of the
> pooling. Is it possible to have Fetch do batches vs a single file? Would
> that improve throughput? Also is that 10 seconds configurable?
> Some background: I'm converting 2 single nodes into a 5 node cluster and
> trying to figure out the best approach.
> Thanks again!
> On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <> wrote:
>> Ryan,
>> Personally I don't have experience running these processors at scale,
>> but from a code perspective they are fundamentally different...
>> GetSFTP is a source processor, meaning is not being fed by an upstream
>> connection, so when it executes it can create a connection and
>> retrieve up to max-selects during that one execution.
>> FetchSFTP is being told to fetch one specific file, typically through
>> attributes on incoming flow files, so the concept of max-selects
>> doesn't really apply because there is only thing to select during an
>> execution of the processor.
>> FetchSFTP does employ connection pooling behind the scenes such that
>> it will keep open a connection for each concurrent task, as long as
>> each connection continues to be used with in 10 seconds.
>> -Bryan
>> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <> wrote:
>> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i
>> > can confirm there are users at that range for it.
>> >
>> > Thanks
>> >
>> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <>
>> wrote:
>> >> I've found that on a single node getSFTP is able to pull more files off
>> a
>> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a
>> max
>> >> selects so it is requiring way more connections (one per file?) and
>> >> concurrent threads to keep up.
>> >>
>> >> Was wondering if anyone is using List/Fetch at scale? In the multi TB's
>> a
>> >> day range?
>> >>
>> >> Thanks,
>> >> Ryan

View raw message