nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Cave <>
Subject MiNiFi C++ Data Provenance and Related Issues
Date Mon, 28 Nov 2016 14:25:50 GMT
This is a break off from the discussion on the MiNiFi C++ 0.1.0 Release
thread.  I assume a hub and spoke NiFi/MiNiFi C++ architecture.

As discussed on that thread, I am concerned about the existing choice for
data provenance tracking and the implications it leads to as well as the
current data provenance requirements for MiNiFi C++.  MiNiFi C++ must be
highly efficient and carry a minimal footprint in order to be able to
function at background and embedded levels.  As such, performance and space
are priorities as are the ability to communicate to the NiFi hub the needed
information (i.e. there isn't space for a large unindexed data provenance
archive locally nor the processing ability to handle it).

The data provenance registry must be:  1) Fault tolerant, 2) able to be
easily purged, 3) fast to write, 4) easily accessed in session, 5) easily
accessed post session.  The current choice (LevelDB) meets #3, but not the
other 4 requirements.  LevelDB is prone to corruption in cases of
application failure during a write (fails #1).  LevelDB has no indexing, and
if keys are by UUID then there is no way to efficiently sort by date or by
parent/child (fails #2, #4, #5).  The choice for a provenance store should
answer as many of these as possible.  For permanent stores, the choices
would be super lightweight databases or something fault resistent like LMDB. 
I don't have any preference, just that it functionally addresses as many
criteria as possible and absolutely satisfies #1.

A solution to #4 and #5 could be that the entire provenance tree inside
MiNiFi C++ rides with the flowfile and transfers to NiFi (including through
descendants).  This I see as something of a requirement as well, as it is
the only efficient way to provide cradle to grave provenance through the
entire MiNiFi/NiFi system without the need for heavy post processing to
reconstruct the tree.  While this adds slightly to the package being sent
between MiNiFi and NiFi, it's negligible compared to post query this
especially where MiNiFi is embedded or on an IoT device.

Any thoughts?

View this message in context:
Sent from the Apache NiFi Developer List mailing list archive at

View raw message