nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Skora <>
Subject Re: Thoughts on NIFI-1847 Improve Provenance Space Utilization
Date Wed, 01 Feb 2017 15:24:52 GMT

The gist of the original ticket NIFI-1847 Improve Provenance Space
Utilization <> was about
efficient use of the configured repository space and support for multiple
asymmetric storage locations.  My email was seeking input on that,
especially in consideration of property changes necessary to describe
discrete storage locations of different sizes.

I think any improvement to the repository performance will be welcomed by a
lot of folks, but I'm a little concerned about a complete rewrite.  Do you
plan to port the new repository back to 0.x?  Without porting it back,
users of 0.x will still have problems.

On a heavy provenance test flow I observed storage thrashing and overrun on
0.x and 1.x, but it seemed to be caused by the cleanup logic not the
underlying repository implementation.  With minor changes to cleanup
thresholds it ran better and without overrunning storage.  The PRs
submitted in November on NIFI-3039
<>[2] (PR#1240
<>[3] and PR#1241
<>[4]) implemented those cleanup
threshold adjustments.  I know you commented on NIFI-3039, but did you try
the changes before starting a complete rewrite?

The first problem was that the current repository starts cleanup at >90%
used but stops once it reaches <100% used, so it tends to fluctuate close
to capacity increasing cleanup cycles.  Similarly, rollover limits itself
to 110% of configured space, implying an intentional overrun.  The changes
on the PRs resulted in intermittent instead of constant cleanup, so
provenance ran smoother and more reliably even with the current repository



On Fri, Jan 27, 2017 at 9:58 AM, Mark Payne <> wrote:

> Hey Joe,
> Sorry - I don't think I saw this. I have actually been working on
> NIFI-3356 [1] for which
> I hope to have a PR up in the next few days. I've been doing some
> long-running tests,
> and I did find an issue yesterday so I've redeployed to some nodes to let
> it run over the
> weekend. If all looks good I can perhaps have a PR in on Monday.
> The Persistent Provenance Repository is quite old. At the time that it was
> written, the requirements
> were simply to store data in a sequential fashion and make it available
> for a Reporting Task to iterate
> over the events sequentially. There was no compression, and there was no
> indexing/searching. The
> requirements clearly have changed over the years :) So I started working
> on a totally new implementation
> and my testing shows that it is 2-3 times faster than the Persistent
> Provenance Repository while at the
> same time providing faster query capabilities and immediate access to
> events (as opposed to after a 30-
> second rollover period).
> When I get a chance to get it posted, it would be great if you want to put
> it through the ringer as well.
> I say all of this, because if you are interested, it may be worth holding
> off a few days and looking into
> implementing something similar to the new repo instead of focusing on the
> PersistentProvenanceRepository
> (or updating both).
> Thanks
> -Mark
> [1]
> On Jan 27, 2017, at 9:42 AM, Joe Skora <<mailto:jskor
>>> wrote:
> I'm bumping this hoping for some feedback before I dive back into the
> ticket.
> Lacking any response for 30 days, I figure this either got overlooked due
> to year-end or no one has an opinion to add to the discussion (which seems
> unlikely).  ;-)
> On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora <<mailto:jskor
>>> wrote:
> All,
> Before the change to the schema based repositories committed, I was doing
> some testing for NIFI-1847 Improve Provenance Space Utilization
> <> based on these
> assumptions.
>   - A partition {{}} entry would
>   only be individually tracked if there was a corresponding {{
>   nifi.provenance.repository.directorySize.XYZ
>   <http://nifi.provenance.repository.directorySize.XYZ>}} entry,
>   otherwise it will only be considered against the aggregate totals.
>   - The original {{}}
>   property would represent an aggregate across all partitions, whether
>   specifically tracked or not.
>   - Tracked partitions will be evaluated first and their sizes
>   accumulated to avoid double work.
> My testing showed improved use of space by partition, but also showed two
> problems.
>   - Calling the OS for the size of every journal, partition, and index
>   file is expensive so I'm looking at going to the OS every Nth pass and
>   tracking delta writes in between.
>   - Writers are chosen based on round robin which is far from optimal
>   when the size and available space varies by partition.  I some thoughts
> but
>   haven't put anything in code yet.
> Considering that provenance recording seems to be a bottleneck on some
> flows, this needs to be as fast as possible but while staying 100%
> reliable.  So, any thoughts on these issues or wisdom relating to
> repositories and provenance is appreciated.
> Thanks,
> Joe

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message