nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Farbota <jfarb...@payoff.com>
Subject Re: NiFi cluster sluggish...fine after VMs rebooted
Date Tue, 12 Sep 2017 18:25:14 GMT
Yes, these VM's all have new relic installed.

Whenever I have a memory issue, the node seems to go down. We've got a lot
of heap dedicated to these nodes and we're not seeing the memory spike.

When this is happening, we might have 50k flow files system-wide. The thing
is that I've ran processes before where I've had over a million flow files
after a split text and did not have this issue.

The dedicated heap is 26g for each of these nodes. It usually is around 3g
and it will spike if we get a huge dump from kafka but I rarely see it go
over 12g.

[image: Payoff, Inc.]
*Jeremy Farbota*
Software Engineer, Data
Payoff, Inc.

jfarbota@payoff.com
(217) 898-8110 <+2178988110>

On Tue, Sep 12, 2017 at 10:51 AM, Joe Witt <joe.witt@gmail.com> wrote:

> Jeremy
>
> This sounds then like it is probably a memory pressure issue.  Are you
> monitoring memory usage/GC behavior?
>
> Do you have SplitText being used quite a bit or MergeContent by chance?
> It isn't the size of the content but rather the number of flowfiles in
> memory that can cause sluggish behavior.  Easily worked around but lets
> learn a bit more about the current state.
>
> Thanks
>
> On Tue, Sep 12, 2017 at 1:38 PM, Jeremy Farbota <jfarbota@payoff.com>
> wrote:
>
>> Hello,
>>
>> We're having an issue recently where flow files get queued up and sit in
>> the flows (seems like the whole system has back-pressure) seemingly at
>> random times. A few weeks back, we had a problem where the provenance repo
>> was not keeping up with the creation of flow files. We switched to a
>> VolatileProvenanceRepo and that issue was resolved.[1]
>>
>>
>> Since then, we'll occasionally have the system slow down without any
>> errors or warnings. It seems that NiFi is unable to move flow files through
>> the system after they are created. I know it is happening when I see the
>> total number of flow files spike and a few things queued at weird places
>> (e.g. replace text, update attribute, etc. where you usually never see
>> anything queued). Eventually I end up with a huge queue on the last step
>> which usually involves a controller (PutSql or PutHDFS).
>>
>>
>> We've found that this issues is resolved by rebooting all of the
>> machines. Is it possible that the volatile provenance is somehow still
>> getting behind and when we reboot that is flushed thus removing
>> back-pressure? We're trying to figure out why the reboot helps. We're also
>> trying to understand more about different situations where NiFi internally
>> slows down the flows.
>>
>>
>> We see some normal errors in the logs (e.g. security expired, minor
>> errors due to a test flow that has issues). No errors that suggest the
>> system is applying back pressure.
>>
>>
>> At this point, I'm going to create a reporting process to alert dev ops
>> when there are files heavily queued so we can get alerted when it's
>> happening.
>>
>>
>> Thanks in advance.
>>
>>
>> [1] http://apache-nifi.1125220.n5.nabble.com/Rate-of-the-
>> data-flow-is-exceeding-the-provenance-recording-rate-Slow
>> ing-down-flow-to-accomodate-td9213.html
>>
>>
>>
>> [image: Payoff, Inc.]
>> *Jeremy Farbota*
>> Software Engineer, Data
>> Payoff, Inc.
>>
>> jfarbota@payoff.com
>>
>
>

Mime
View raw message