nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Lagemann <sebast...@wearerealitygames.com>
Subject Re: Options for increasing performance?
Date Fri, 07 Apr 2017 08:20:19 GMT
Jim,

we experienced 2k flowfiles per second on HandleHTTPRequest with 50 threads on the processor
without issues, the issue was later in processors down the flow and primarily related to slow
Disk-IO.

Best,

Seb

> Am 06.04.2017 um 12:00 schrieb James McMahon <jsmcmahon3@gmail.com>:
> 
> Intriguing. I'm one of those who have employed the "single flowfile" approach. I'm certainly
willing to test out this refinement.
> So to press your point, this is more efficient than setting the processor's "Concurrent
tasks" to 10 because it assumes the burden of initialization for ExecuteScript once, rather
than using the processor configuration parm (which presumably assumes that initialization
burden ten times)?
> 
> I currently set "Concurrent tasks" to 50.  The logjam I am seeing is not in my ExecuteScript
processor. My delay is definitely a non-steady, "non-fast" stream of data at my HandleHttpRequest
processor, the first processor in my workflow. Why that is the case is a mystery we've yet
to resolve.
> 
> One thing I'd welcome is some idea of what is a reasonable expectation for requests handled
by HandleHttpRequest in an hour? Maybe 1500 in an hour is low, high, or perhaps it is entirely
reasonable. We really have little insight. Any empirical data from user practical experience
would be most welcome. 
> 
> Also, I added a second HandleHttpRequest fielding requests on a second port. I did not
see any level of improved throughput. Why might that be? My expectation was that with two
doors open rather than one, I'd see some more influx of data.
> 
> Thank you.
> - Jim
> 
>> On Wed, Apr 5, 2017 at 4:26 PM, Scott Wagner <swagner@beenverified.com> wrote:
>> One of my experiences is that when using ExecuteScript and Python is that having
an ExecuteScript that works on an individual FlowFile when you have multiple in the input
queue is very inefficient, even when you set it to a timer of 0  sec.
>> 
>> Instead, I have the following in all of my Python scripts:
>> 
>> flowFiles = session.get(10)
>> for flowFile in flowFiles:
>>     if flowFile is None:
>>         continue
>>     # Do stuff here
>> 
>> That seems to improve the throughput of the ExecuteScript processor dramatically.
>> 
>> YMMV
>> 
>> - Scott
>>> James McMahon Wednesday, April 5, 2017 12:48 PM
>>> I am receiving POSTs from a Pentaho process, delivering files to my NiFi 0.7.x
workflow HandleHttpRequest processor. That processor hands the flowfile off to an ExecuteScript
processor that runs a python script. This script is very, very simple: it takes an incoming
JSO object and loads it into a Python dictionary, and verifies the presence of required fields
using simple has_key checks on the dictionary. There are only eight fields in the incoming
JSON object.
>>> 
>>> The throughput for these two processes is not exceeding 100-150 files in five
minutes. It seems very slow in light of the minimal processing going on in these two steps.
>>> 
>>> I notice that there are configuration operations seemingly related to optimizing
performance. "Concurrent tasks", for example,  is only set by default to 1 for each processor.
>>> 
>>> What performance optimizations at the processor level do users recommend? Is
it advisable to crank up the concurrent tasks for a processor, and is there an optimal performance
point beyond which you should not crank up that value? Are there trade-offs?
>>> 
>>> I am particularly interested in optimizations for HandleHttpRequest and ExecuteScript
processors.
>>> 
>>> Thanks in advance for your thoughts.
>>> 
>>> cheers,
>>> 
>>> Jim
>> 
> 

Mime
View raw message