nifi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: Deadlocks after upgrade from 0.6.1 to 1.1.1
Date Fri, 17 Feb 2017 05:13:52 GMT
Mike

Totally get it.  If you are able to on this or another system get back
into that state we're highly interested to learn more.  In looking at
the code relevant to your stack trace I'm not quite seeing the trail
just yet.  The problem is definitely with the persistent prov.
Getting the phased thread dumps will help tell more of the story.

Also, can you tell us anything about the volume/mount that the nifi
install and specific provenance is on?  Any interesting mount options
involving timestamps, etc..?

No rush of course and glad you're back in business.  But, you've
definitely got our attention :-)

Thanks
Joe

On Fri, Feb 17, 2017 at 12:10 AM, Mikhail Sosonkin <mikhail@synack.com> wrote:
> Joe,
>
> Many thanks for the pointer on the Volatile provenance. It is, indeed, more
> critical for us that the data moves. Before receiving this message, I
> changed the config and restarted. The data started moving which is awesome!
>
> I'm happy to help you debug this issue. Do you need these collections with
> the volatile setting or persistent setting in locked state?
>
> Mike.
>
> On Thu, Feb 16, 2017 at 11:56 PM, Joe Witt <joe.witt@gmail.com> wrote:
>>
>> Mike
>>
>> One more thing...can you please grab a couple more thread dumps for us
>> with 5 to 10 mins between?
>>
>> I don't see a deadlock but do suspect either just crazy slow IO going
>> on or a possible livelock.  The thread dump will help narrow that down
>> a bit.
>>
>> Can you run 'iostat -xmh 20' for a bit (or its equivalent) on the
>> system too please.
>>
>> Thanks
>> Joe
>>
>> On Thu, Feb 16, 2017 at 11:52 PM, Joe Witt <joe.witt@gmail.com> wrote:
>> > Mike,
>> >
>> > No need for more info.  Heap/GC looks beautiful.
>> >
>> > The thread dump however, shows some problems.  The provenance
>> > repository is locked up.  Numerous threads are sitting here
>> >
>> > at
>> > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>> > at
>> > org.apache.nifi.provenance.PersistentProvenanceRepository.persistRecord(PersistentProvenanceRepository.java:757)
>> >
>> > This means these are processors committing their sessions and updating
>> > provenance but they're waiting on a readlock to provenance.  This lock
>> > cannot be obtained because a provenance maintenance thread is
>> > attempting to purge old events and cannot.
>> >
>> > I recall us having addressed this so am looking to see when that was
>> > addressed.  If provenance is not critical for you right now you can
>> > swap out the persistent implementation with the volatile provenance
>> > repository.  In nifi.properties change this line
>> >
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
>> >
>> > to
>> >
>> >
>> > nifi.provenance.repository.implementation=org.apache.nifi.provenance.VolatileProvenanceRepository
>> >
>> > The behavior reminds me of this issue which was fixed in 1.x
>> > https://issues.apache.org/jira/browse/NIFI-2395
>> >
>> > Need to dig into this more...
>> >
>> > Thanks
>> > Joe
>> >
>> > On Thu, Feb 16, 2017 at 11:36 PM, Mikhail Sosonkin <mikhail@synack.com>
>> > wrote:
>> >> Hi Joe,
>> >>
>> >> Thank you for your quick response. The system is currently in the
>> >> deadlock
>> >> state with 10 worker threads spinning. So, I'll gather the info you
>> >> requested.
>> >>
>> >> - The available space on the partition is 223G free of 500G (same as
>> >> was
>> >> available for 0.6.1)
>> >> - java.arg.3=-Xmx4096m in bootstrap.conf
>> >> - thread dump and jstats are here
>> >> https://gist.github.com/nologic/1ac064cb42cc16ca45d6ccd1239ce085
>> >>
>> >> Unfortunately, it's hard to predict when the decay starts and it takes
>> >> too
>> >> long to have to monitor the system manually. However, if you still
>> >> need,
>> >> after seeing the attached dumps, the thread dumps while it decays I can
>> >> set
>> >> up a timer script.
>> >>
>> >> Let me know if you need any more info.
>> >>
>> >> Thanks,
>> >> Mike.
>> >>
>> >>
>> >> On Thu, Feb 16, 2017 at 9:54 PM, Joe Witt <joe.witt@gmail.com> wrote:
>> >>>
>> >>> Mike,
>> >>>
>> >>> Can you capture a series of thread dumps as the gradual decay occurs
>> >>> and signal at what point they were generated specifically calling out
>> >>> the "now the system is doing nothing" point.  Can you check for space
>> >>> available on the system during these times as well.  Also, please
>> >>> advise on the behavior of the heap/garbage collection.  Often (not
>> >>> always) a gradual decay in performance can suggest an issue with GC
as
>> >>> you know.  Can you run something like
>> >>>
>> >>> jstat -gcutil -h5 <pid> 1000
>> >>>
>> >>> And capture those rules in these chunks as well.
>> >>>
>> >>> This would give us a pretty good picture of the health of the system/
>> >>> and JVM around these times.  It is probably too much for the mailing
>> >>> list for the info so feel free to create a JIRA for this and put
>> >>> attachments there or link to gists in github/etc.
>> >>>
>> >>> Pretty confident we can get to the bottom of what you're seeing
>> >>> quickly.
>> >>>
>> >>> Thanks
>> >>> Joe
>> >>>
>> >>> On Thu, Feb 16, 2017 at 9:43 PM, Mikhail Sosonkin <mikhail@synack.com>
>> >>> wrote:
>> >>> > Hello,
>> >>> >
>> >>> > Recently, we've upgraded from 0.6.1 to 1.1.1 and at first everything
>> >>> > was
>> >>> > working well. However, a few hours later none of the processors
were
>> >>> > showing
>> >>> > any activity. Then, I tried restarting nifi which caused some
>> >>> > flowfiles
>> >>> > to
>> >>> > get corrupted evidenced by exceptions thrown in the nifi-app.log,
>> >>> > however
>> >>> > the processors still continue to produce no activity. Next, I stop
>> >>> > the
>> >>> > service and delete all state (content_repository database_repository
>> >>> > flowfile_repository provenance_repository work). Then the processors
>> >>> > start
>> >>> > working for a few hours (maybe a day) until the deadlock occurs
>> >>> > again.
>> >>> >
>> >>> > So, this cycle continues where I have to periodically reset the
>> >>> > service
>> >>> > and
>> >>> > delete the state to get things moving. Obviously, that's not great.
>> >>> > I'll
>> >>> > note that the flow.xml file has been changed, as I added/removed
>> >>> > processors,
>> >>> > by the new version of nifi but 95% of the flow configuration is
the
>> >>> > same
>> >>> > as
>> >>> > before the upgrade. So, I'm wondering if there is a configuration
>> >>> > setting
>> >>> > that causes these deadlocks.
>> >>> >
>> >>> > What I've been able to observe is that the deadlock is "gradual"
in
>> >>> > that
>> >>> > my
>> >>> > flow usually takes about 4-5 threads to execute. The deadlock causes
>> >>> > the
>> >>> > worker threads to max out at the limit and I'm not even able to
stop
>> >>> > any
>> >>> > processors or list queues. I also, have not seen this behavior
in a
>> >>> > fresh
>> >>> > install of Nifi where the flow.xml would start out empty.
>> >>> >
>> >>> > Can you give me some advise on what to do about this? Would the
>> >>> > problem
>> >>> > be
>> >>> > resolved if I manually rebuild the flow with the new version of
Nifi
>> >>> > (not
>> >>> > looking forward to that)?
>> >>> >
>> >>> > Much appreciated.
>> >>> >
>> >>> > Mike.
>> >>> >
>> >>> > This email may contain material that is confidential for the sole
>> >>> > use of
>> >>> > the
>> >>> > intended recipient(s).  Any review, reliance or distribution or
>> >>> > disclosure
>> >>> > by others without express permission is strictly prohibited.  If
you
>> >>> > are
>> >>> > not
>> >>> > the intended recipient, please contact the sender and delete all
>> >>> > copies
>> >>> > of
>> >>> > this message.
>> >>
>> >>
>> >>
>> >> This email may contain material that is confidential for the sole use
>> >> of the
>> >> intended recipient(s).  Any review, reliance or distribution or
>> >> disclosure
>> >> by others without express permission is strictly prohibited.  If you
>> >> are not
>> >> the intended recipient, please contact the sender and delete all copies
>> >> of
>> >> this message.
>
>
>
> This email may contain material that is confidential for the sole use of the
> intended recipient(s).  Any review, reliance or distribution or disclosure
> by others without express permission is strictly prohibited.  If you are not
> the intended recipient, please contact the sender and delete all copies of
> this message.

Mime
View raw message