nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Payne <marka...@hotmail.com>
Subject Re: NiFi 0.5.1 "too many open files"
Date Thu, 28 Apr 2016 15:16:56 GMT
Mike,

I am looking into this, and I was able to figure out how we could potentially archive (and
eventually
age off from the archive) a piece of data for which there is still an open file handle. Specifically,
I am able to understand how this could happen only when we have a problem reading a FlowFile
(such as ContentNotFoundException) during a call to ProcessSession.write() with a StreamCallback.

So exactly the situation that you laid out here. I will be looking into how to address this.

Do you have any idea what may have caused your ContentNotFoundException to occur in the first
place?

Thanks
-Mark


> On Apr 27, 2016, at 5:24 PM, Michael Moser <moser.mw@gmail.com> wrote:
> 
> I found something in the logs on the nodes where I had a problem.  A
> ContentNotFoundException begins occurring on these nodes and after many
> thousands of times we eventually get "too many open files".  Once I do
> surgery on the content repository so that ContentNotFoundException stops
> happening, then lsof tells me that I'm not 'leaking' any more (deleted)
> file descriptors.  I have no idea why the ContentNotFoundException starts
> happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported
> back in March.
> 
> Here's a stack trace of the ContentNotFoundException
> 
> 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1]
> MyCustomProcessor
> org.apache.nifi.processor.exception.MissingFlowFileException: Unable to
> find content for FlowFile
>  at
> o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349)
>  at
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186)
>  at MyCustomProcessor.onTrigger()
>  at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not
> find content for StandardContentClaim
> [resourceClaim=StandardResourceClaim[id=123456789-1, container=default,
> section=1], offset=351144, length=30720]
>  at
> o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492)
>  at
> o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801)
>  at
> o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743)
>  at
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152)
>  ... common frames omitted
> 
> MyCustomProcessor just calls session.write() with a StreamCallback
> implementation that reads from an InputStream and writes to an OutputStream.
> 
> [1] -
> http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html
> 
> -- Mike
> 
> 
> 
> 
> On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt <joe.witt@gmail.com> wrote:
> 
>> Mike
>> 
>> Ok that is a good data point.  In my case they are all in archive but
>> I do agree that isn't super meaningful because in reality nothing
>> should ever be open for writing in the archive.
>> 
>> If you can and have enough logging on try searching for that first
>> part of the filename in your logs.  Let's see if we can narrow this to
>> a particular type of flow or something.
>> 
>> Thanks
>> Joe
>> 
>> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <moser.mw@gmail.com>
>> wrote:
>>> Another data point ... we had archiving turned on at first, and then most
>>> (but not all) files that lsof reported were
>>> 
>>> /content_repository/0/archive/123456789-123456 (deleted).
>>> 
>>> We turned archiving off, hoping that was related in some way, but it was
>>> not.
>>> 
>>> -- Mike
>>> 
>>> 
>>> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <joe.witt@gmail.com> wrote:
>>> 
>>>> Mike,
>>>> 
>>>> Definitely does not sound familiar.  However, just looked up what you
>>>> describe and I do see it.  In my case there are only three files but
>>>> they are sitting there open for writing by the nifi process and yet
>>>> have been deleted.  So I do believe there is an issue...will dig in a
>>>> bit but obviously if you make more discoveries here please share.
>>>> 
>>>> Thanks
>>>> Joe
>>>> 
>>>> 
>>>> 
>>>> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <moser.mw@gmail.com>
>>>> wrote:
>>>>> Devs,
>>>>> 
>>>>> We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We
>> noticed
>>>>> half of our cluster nodes getting "too many open files" errors that
>>>> require
>>>>> a NiFi restart, while the other half works without this problem.
>> Using
>>>>> 'lsof -p <pid>' to identify the open file descriptors at the time
of
>> the
>>>>> problem, we see most of the file descriptors reference deleted files
>> in
>>>> the
>>>>> content repository like this:
>>>>> 
>>>>> java <pid> <user> <fd> ... /content_repository/81/123456789-123456
>>>> (deleted)
>>>>> 
>>>>> A 'ls /content_repository/81/123456789-123456' confirms that the file
>> has
>>>>> been deleted.
>>>>> 
>>>>> We are continuing our investigation into why some of our nodes have a
>>>>> problem but others don't.  Has anyone else seen this?  Did anything
>>>> change
>>>>> between 0.4.1 and 0.5.1 related to deleting files from the content
>>>>> repository?
>>>>> 
>>>>> Regards,
>>>>> -- Mike
>>>> 
>> 


Mime
View raw message