nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Payne <>
Subject Re: Thoughts on NIFI-1847 Improve Provenance Space Utilization
Date Fri, 27 Jan 2017 14:58:25 GMT
Hey Joe,

Sorry - I don't think I saw this. I have actually been working on NIFI-3356 [1] for which
I hope to have a PR up in the next few days. I've been doing some long-running tests,
and I did find an issue yesterday so I've redeployed to some nodes to let it run over the
weekend. If all looks good I can perhaps have a PR in on Monday.

The Persistent Provenance Repository is quite old. At the time that it was written, the requirements
were simply to store data in a sequential fashion and make it available for a Reporting Task
to iterate
over the events sequentially. There was no compression, and there was no indexing/searching.
requirements clearly have changed over the years :) So I started working on a totally new
and my testing shows that it is 2-3 times faster than the Persistent Provenance Repository
while at the
same time providing faster query capabilities and immediate access to events (as opposed to
after a 30-
second rollover period).

When I get a chance to get it posted, it would be great if you want to put it through the
ringer as well.
I say all of this, because if you are interested, it may be worth holding off a few days and
looking into
implementing something similar to the new repo instead of focusing on the PersistentProvenanceRepository
(or updating both).



On Jan 27, 2017, at 9:42 AM, Joe Skora <<>>

I'm bumping this hoping for some feedback before I dive back into the

Lacking any response for 30 days, I figure this either got overlooked due
to year-end or no one has an opinion to add to the discussion (which seems
unlikely).  ;-)

On Tue, Dec 27, 2016 at 2:50 PM, Joe Skora <<>>


Before the change to the schema based repositories committed, I was doing
some testing for NIFI-1847 Improve Provenance Space Utilization
<> based on these

  - A partition {{}} entry would
  only be individually tracked if there was a corresponding {{
  <http://nifi.provenance.repository.directorySize.XYZ>}} entry,
  otherwise it will only be considered against the aggregate totals.
  - The original {{}}
  property would represent an aggregate across all partitions, whether
  specifically tracked or not.
  - Tracked partitions will be evaluated first and their sizes
  accumulated to avoid double work.

My testing showed improved use of space by partition, but also showed two

  - Calling the OS for the size of every journal, partition, and index
  file is expensive so I'm looking at going to the OS every Nth pass and
  tracking delta writes in between.
  - Writers are chosen based on round robin which is far from optimal
  when the size and available space varies by partition.  I some thoughts but
  haven't put anything in code yet.

Considering that provenance recording seems to be a bottleneck on some
flows, this needs to be as fast as possible but while staying 100%
reliable.  So, any thoughts on these issues or wisdom relating to
repositories and provenance is appreciated.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message