nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Wicks (pwicks)" <>
Subject RE: Requesting Obscene FlowFile Batch Sizes
Date Wed, 21 Sep 2016 02:39:21 GMT

Thanks for all of the detail, it’s been helpful.
I actually did an experiment this morning where I modified the processor to force it to keep
calling `get` until it had all 1 million FlowFiles.  Since I was calling it sequentially it
was able to move files out of swap and into active on each request. I was able to retrieve
them and process them through, which was great until… NiFi tried to move them through provenance.
 At that point NiFi ran out of memory and fell over (stopped responding).  Right before NiFi
ran out of memory I received several bulletins related to Provenance being written to too
quickly, and that it was being slowed down.

I found another solution to my mass insert and got it up and running. Using a Teradata JDBC
proprietary flag called FastLoadCSV, and a new custom processor, I was able to pass in a CSV
file to my JDBC driver and get the same result.  In this scenario there was just a single
FlowFile and everything went smoothly.

Thanks again!

Peter Wicks

From: Bryan Bende []
Sent: Tuesday, September 20, 2016 3:38 PM
Subject: Re: Requesting Obscene FlowFile Batch Sizes


That was my thinking. An easy test might be to bump the threshold up to 100k (increase heap
if needed) and see if it starts grabbing 100k every time.

If it does then I would think it is swapping related, then need to figure out if you really
want to get all 1 million in a single batch, and if theres enough heap to support that.


On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto <<>>

That’s a good point. Would running with a larger Java heap and higher swap threshold allow
Peter to get larger batches out?

Andy LoPresto<><>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Sep 20, 2016, at 1:41 PM, Bryan Bende <<>>


Does 10k happen to be your swap threshold in by any chance (it defaults to
20k I believe)?

I suspect the behavior you are seeing could be due to the way swapping works, but Mark or
others could probably confirm.

I found this thread where Mark explained how swapping works with a background thread, and
I believe it still works this way:


On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) <<>>
I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which supports a special
JDBC mode called FastLoad, designed for a minimum of 100,000 rows of data per batch.

What I’m finding is that when PutSQL requests a new batch of FlowFiles from the queue, which
has over 1 million rows in it, with a batch size of 1000000, it always returns a maximum of
10k.  How can I get my obscenely sized batch request to return all the FlowFile’s I’m
asking for?


View raw message