nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Gresock <jgres...@gmail.com>
Subject Re: Content Repository Cleanup
Date Sat, 10 Dec 2016 11:57:34 GMT
Not sure if your scenario is related, but one of the NiFi devs recently
explained to me that the files in the content repository are actually
appended together with other flow file content (please correct me if I'm
explaining it wrong).  That means if you have many small flow files in your
current backlog, and several large flow files have recently left the flow,
the large ones could still be hanging around in the content repository as
long as the small ones are still there, if they're in the same appended
files on disk.

This scenario recently happened to us: we had a flow with ~20 million tiny
flow files queued up, and at the same time we were also processing a bunch
of 1GB files, which left the flow quickly.  The content repository was much
larger than what was actually being reported in the flow stats, and our
disks were almost full.  On a hunch, I tried the following strategy:
- MergeContent the tiny flow files using flow-file-v3 format (to capture
all attributes)
- MergeContent 10,000 of the packaged flow files using tar format for
easier storage on disk
- PutFile into a directory
- GetFile from the same directory, but using back pressure from here on out
(so that the flow simply wouldn't pull the same files from disk until it
was really ready for them)
- UnpackContent (untar them)
- UnpackContent (turn them back into flow files with the original
attributes)
- Then do the processing they were originally designed for

This had the effect of very quickly reducing the size of my content
repository to very nearly the actual size I saw reported in the flow, and
my disk usage dropped from ~95% to 50%, which is the configured content
repository max usage percentage.  I haven't had any problems since.

Hope this helps.
Joe

On Sat, Dec 10, 2016 at 12:04 AM, Joe Witt <joe.witt@gmail.com> wrote:

> Alan,
>
> That retention percentage only has to do with the archive of data
> which kicks in once a given chunk of content is no longer reachable by
> active flowfiles in the flow.  For it to grow to 100% typically would
> mean that you have data backlogged in the flow that account for that
> much space.  If that is certainly not the case for you then we need to
> dig deeper.  If you could do screenshots or share log files and stack
> dumps around this time those would all be helpful.  If the screenshots
> and such are too sensitive please just share as much as you can.
>
> Thanks
> Joe
>
> On Fri, Dec 9, 2016 at 9:55 PM, Alan Jackoway <alanj@cloudera.com> wrote:
> > One other note on this, when it came back up there were tons of messages
> > like this:
> >
> > 2016-12-09 18:36:36,244 INFO [main] o.a.n.c.repository.
> FileSystemRepository
> > Found unknown file /path/to/content_repository/498/1481329796415-87538
> > (1071114 bytes) in File System Repository; archiving file
> >
> > I haven't dug into what that means.
> > Alan
> >
> > On Fri, Dec 9, 2016 at 9:53 PM, Alan Jackoway <alanj@cloudera.com>
> wrote:
> >
> >> Hello,
> >>
> >> We have a node on which nifi content repository keeps growing to use
> 100%
> >> of the disk. It's a relatively high-volume process. It chewed through
> more
> >> than 100GB in the three hours between when we first saw it hit 100% of
> the
> >> disk and when we just cleaned it up again.
> >>
> >> We are running nifi 1.1 for this. Our nifi.properties looked like this:
> >>
> >> nifi.content.repository.implementation=org.apache.
> >> nifi.controller.repository.FileSystemRepository
> >> nifi.content.claim.max.appendable.size=10 MB
> >> nifi.content.claim.max.flow.files=100
> >> nifi.content.repository.directory.default=./content_repository
> >> nifi.content.repository.archive.max.retention.period=12 hours
> >> nifi.content.repository.archive.max.usage.percentage=50%
> >> nifi.content.repository.archive.enabled=true
> >> nifi.content.repository.always.sync=false
> >>
> >> I just bumped retention period down to 2 hours, but should max usage
> >> percentage protect us from using 100% of the disk?
> >>
> >> Unfortunately we didn't get jstacks on either failure. If it hits 100%
> >> again I will make sure to get that.
> >>
> >> Thanks,
> >> Alan
> >>
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message