nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Bende <bbe...@gmail.com>
Subject Re: FetchSFTP vs GetSFTP
Date Tue, 31 Oct 2017 20:26:24 GMT
Ryan,

The 10 seconds appears to be a hard-code rule in the processor,
although it seems like it could be turned into a configurable
property.

It would require a code change to make it grab a batch of flow files
during a single execution. In theory it shouldn't provide that much of
a difference, but might be an interesting experiment. It makes the
code more challenging to write though, not that that's a reason not to
do it.

If you have a 5 node cluster, you are doing List on primary node and
then redistributing the results to all the nodes via an RPG so all
nodes can fetch?

-Bryan


On Tue, Oct 31, 2017 at 3:43 PM, Ryan Ward <ryan.ward2@gmail.com> wrote:
> Joe/Bryan Thanks!
>
> I believe the one specific file per concurrent task/connection (and too
> many threads) is the issue I have we have a lot of small files and often
> times backed up . I'm going to drop the task count to take advantage of the
> pooling. Is it possible to have Fetch do batches vs a single file? Would
> that improve throughput? Also is that 10 seconds configurable?
>
> Some background: I'm converting 2 single nodes into a 5 node cluster and
> trying to figure out the best approach.
>
> Thanks again!
>
>
>
> On Tue, Oct 31, 2017 at 2:56 PM, Bryan Bende <bbende@gmail.com> wrote:
>
>> Ryan,
>>
>> Personally I don't have experience running these processors at scale,
>> but from a code perspective they are fundamentally different...
>>
>> GetSFTP is a source processor, meaning is not being fed by an upstream
>> connection, so when it executes it can create a connection and
>> retrieve up to max-selects during that one execution.
>>
>> FetchSFTP is being told to fetch one specific file, typically through
>> attributes on incoming flow files, so the concept of max-selects
>> doesn't really apply because there is only thing to select during an
>> execution of the processor.
>>
>> FetchSFTP does employ connection pooling behind the scenes such that
>> it will keep open a connection for each concurrent task, as long as
>> each connection continues to be used with in 10 seconds.
>>
>> -Bryan
>>
>>
>> On Tue, Oct 31, 2017 at 11:43 AM, Joe Witt <joe.witt@gmail.com> wrote:
>> > Ryan - dont know the code specifics behind FetchSFTP off-hand but i
>> > can confirm there are users at that range for it.
>> >
>> > Thanks
>> >
>> > On Tue, Oct 31, 2017 at 11:38 AM, Ryan Ward <ryan.ward2@gmail.com>
>> wrote:
>> >> I've found that on a single node getSFTP is able to pull more files off
>> a
>> >> remote server than Fetch in a cluster. I noticed Fetch doesn't have a
>> max
>> >> selects so it is requiring way more connections (one per file?) and
>> >> concurrent threads to keep up.
>> >>
>> >> Was wondering if anyone is using List/Fetch at scale? In the multi TB's
>> a
>> >> day range?
>> >>
>> >> Thanks,
>> >> Ryan
>>

Mime
View raw message