nifi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From marka...@apache.org
Subject svn commit: r1655856 [2/2] - /incubator/nifi/site/trunk/content/docs/nifi-docs/developer-guide.html
Date Thu, 29 Jan 2015 20:51:40 GMT

Modified: incubator/nifi/site/trunk/content/docs/nifi-docs/developer-guide.html
URL: http://svn.apache.org/viewvc/incubator/nifi/site/trunk/content/docs/nifi-docs/developer-guide.html?rev=1655856&r1=1655855&r2=1655856&view=diff
==============================================================================
--- incubator/nifi/site/trunk/content/docs/nifi-docs/developer-guide.html (original)
+++ incubator/nifi/site/trunk/content/docs/nifi-docs/developer-guide.html Thu Jan 29 20:51:40 2015
@@ -426,14 +426,18 @@ body.book #toc,body.book #preamble,body.
 .show-for-print{display:inherit!important}}
 </style>
 <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.2.0/css/font-awesome.min.css">
-	<script>
+	
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
   (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
   m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
   })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
 
   ga('create', 'UA-57264262-1', 'auto');
   ga('send', 'pageview');
+
 </script>
+	
 </head>
 <body class="article">
 <div id="header">
@@ -445,53 +449,2465 @@ body.book #toc,body.book #preamble,body.
 <div id="toc" class="toc">
 <div id="toctitle">Table of Contents</div>
 <ul class="sectlevel1">
-<li><a href="#the-designed-points-of-extension">The designed points of extension</a></li>
-<li><a href="#the-nifi-archive-nar-and-nifi-classloading">The NiFi Archive (NAR) and NiFi Classloading</a></li>
-<li><a href="#how-to-build-extensions">How to build extensions</a></li>
-<li><a href="#design-considerations">Design considerations</a></li>
-<li><a href="#consider-the-user-experience">Consider the User Experience</a></li>
-<li><a href="#how-to-contribute-to-apache-nifi">How to contribute to Apache NiFi</a></li>
+<li><a href="#introduction">Introduction</a></li>
+<li><a href="#components">NiFi Components</a></li>
+<li><a href="#processor_api">Processor API</a>
+<ul class="sectlevel2">
+<li><a href="#supporting_api">Supporting API</a></li>
+<li><a href="#AbstractProcessor">AbstractProcessor API</a></li>
+<li><a href="#component-lifecycle">Component Lifecycle</a></li>
+<li><a href="#reporting-processor-activity">Reporting Processor Activity</a></li>
+</ul>
+</li>
+<li><a href="#documenting-a-component">Documenting a Component</a>
+<ul class="sectlevel2">
+<li><a href="#documenting-properties">Documenting Properties</a></li>
+<li><a href="#documenting-relationships">Documenting Relationships</a></li>
+<li><a href="#documenting-capability-and-keywords">Documenting Capability and Keywords</a></li>
+<li><a href="#advanced-documentation">Advanced Documentation</a></li>
+</ul>
+</li>
+<li><a href="#common-processor-patterns">Common Processor Patterns</a>
+<ul class="sectlevel2">
+<li><a href="#ingress">Data Ingress</a></li>
+<li><a href="#data-egress">Data Egress</a></li>
+<li><a href="#route-based-on-content-one-to-one">Route Based on Content (One-to-One)</a></li>
+<li><a href="#route-based-on-content-one-to-many">Route Based on Content (One-to-Many)</a></li>
+<li><a href="#route-streams-based-on-content-one-to-many">Route Streams Based on Content (One-to-Many)</a></li>
+<li><a href="#route-based-on-attributes">Route Based on Attributes</a></li>
+<li><a href="#split-content-one-to-many">Split Content (One-to-Many)</a></li>
+<li><a href="#update-attributes-based-on-content">Update Attributes Based on Content</a></li>
+<li><a href="#enrich-modify-content">Enrich/Modify Content</a></li>
+</ul>
+</li>
+<li><a href="#error-handling">Error Handling</a>
+<ul class="sectlevel2">
+<li><a href="#exceptions-within-the-processor">Exceptions within the Processor</a></li>
+<li><a href="#exceptions-within-a-callback-ioexception-runtimeexception">Exceptions within a callback: IOException, RuntimeException</a></li>
+<li><a href="#penalization-vs-yielding">Penalization vs. Yielding</a></li>
+<li><a href="#session-rollback">Session Rollback</a></li>
+</ul>
+</li>
+<li><a href="#general-design-considerations">General Design Considerations</a>
+<ul class="sectlevel2">
+<li><a href="#consider-the-user">Consider the User</a></li>
+<li><a href="#cohesion-and-reusability">Cohesion and Reusability</a></li>
+<li><a href="#naming-convensions">Naming Conventions</a></li>
+<li><a href="#processor-behavior-annotations">Processor Behavior Annotations</a></li>
+<li><a href="#data-buffering">Data Buffering</a></li>
+</ul>
+</li>
+<li><a href="#controller-services">Controller Services</a>
+<ul class="sectlevel2">
+<li><a href="#developing-controller-service">Developing a ControllerService</a></li>
+<li><a href="#interacting-with-controller-service">Interacting with a ControllerService</a></li>
+</ul>
+</li>
+<li><a href="#reporting-tasks">Reporting Tasks</a>
+<ul class="sectlevel2">
+<li><a href="#developing-a-reporting-task">Developing a Reporting Task</a></li>
+</ul>
+</li>
+<li><a href="#testing">Testing</a>
+<ul class="sectlevel2">
+<li><a href="#instantiate-testrunner">Instantiate TestRunner</a></li>
+<li><a href="#add-controllerservices">Add ControllerServices</a></li>
+<li><a href="#set-property-values">Set Property Values</a></li>
+<li><a href="#enqueue-flowfiles">Enqueue FlowFiles</a></li>
+<li><a href="#run-the-processor">Run the Processor</a></li>
+<li><a href="#validate-output">Validate Output</a></li>
+<li><a href="#mocking-external-resources">Mocking External Resources</a></li>
+<li><a href="#additional-testing-capabilities">Additional Testing Capabilities</a></li>
+</ul>
+</li>
+<li><a href="#nars">NiFi Archives (NARs)</a></li>
+<li><a href="#how-to-contribute-to-apache-nifi">How to contribute to Apache NiFi</a>
+<ul class="sectlevel2">
+<li><a href="#technologies">Technologies</a></li>
+<li><a href="#where-to-start">Where to Start?</a></li>
+<li><a href="#supplying-a-contribution">Supplying a contribution</a></li>
+<li><a href="#contact-us">Contact Us</a></li>
+</ul>
+</li>
 </ul>
 </div>
 </div>
 <div id="content">
 <div class="sect1">
-<h2 id="the-designed-points-of-extension"><a class="anchor" href="#the-designed-points-of-extension"></a>The designed points of extension</h2>
+<h2 id="introduction"><a class="anchor" href="#introduction"></a>Introduction</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>The intent of this Developer Guide is to provide the reader with the information needed to understand how Apache NiFi (incubating)
+extensions are developed and help to explain the thought process behind developing the components. It provides an introduction to
+and explanation of the API that is used to develop extensions. It does not, however, go into great detail about each
+of the methods in the API, as this guide is intended to supplement the JavaDocs of the API rather than replace them.
+This guide also assumes that the reader is familiar with Java 7 and Apache Maven.</p>
+</div>
+<div class="paragraph">
+<p>This guide is written by developers for developers. It is expected that before reading this
+guide, you have a basic understanding of NiFi and the concepts of dataflow. If not, please see the <a href="overview.html">NiFi Overview</a>
+and the <a href="user-guide.html">NiFi User Guide</a> to familiarize yourself with the concepts of NiFi.</p>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="components"><a class="anchor" href="#components"></a>NiFi Components</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>NiFi provides several extension points to provide developers the
+ability to add functionality to the application to meet their needs. The following list provides a
+high-level description of the most common extension points:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>Processor</p>
+<div class="ulist">
+<ul>
+<li>
+<p>The Processor interface is the mechanism through which NiFi exposes access to
+<a href="#flowfile">FlowFile</a>s, their attributes, and their content. The Processor is the basic building
+block used to comprise a NiFi dataflow. This interface is used to accomplish
+all of the following tasks:</p>
+<div class="ulist">
+<ul>
+<li>
+<p>Create FlowFiles</p>
+</li>
+<li>
+<p>Read FlowFile content</p>
+</li>
+<li>
+<p>Write FlowFile content</p>
+</li>
+<li>
+<p>Read FlowFile attributes</p>
+</li>
+<li>
+<p>Update FlowFile attributes</p>
+</li>
+<li>
+<p>Ingest data</p>
+</li>
+<li>
+<p>Egress data</p>
+</li>
+<li>
+<p>Route data</p>
+</li>
+<li>
+<p>Extract data</p>
+</li>
+<li>
+<p>Modify data</p>
+</li>
+</ul>
+</div>
+</li>
+</ul>
+</div>
+</li>
+<li>
+<p>ReportingTask</p>
+<div class="ulist">
+<ul>
+<li>
+<p>The ReportingTask interface is a mechanism that NiFi exposes to allow metrics,
+monitoring information, and internal NiFi state to be published to external
+endpoints, such as log files, e-mail, and remote web services.</p>
+</li>
+</ul>
+</div>
+</li>
+<li>
+<p>ControllerService</p>
+<div class="ulist">
+<ul>
+<li>
+<p>A ControllerService provides shared state and functionality across Processors, other ControllerServices,
+and ReportingTasks within a single JVM. An example use case may include loading a very
+large dataset into memory. By performing this work in a ControllerService, the data
+can be loaded once and be exposed to all Processors via this service, rather than requiring
+many different Processors to load the dataset themselves.</p>
+</li>
+</ul>
+</div>
+</li>
+<li>
+<p>FlowFilePrioritizer</p>
+<div class="ulist">
+<ul>
+<li>
+<p>The FlowFilePrioritizer interface provides a mechanism by which &lt;&lt;flowfile&gt;s
+in a queue can be prioritized, or sorted, so that the FlowFiles can be processed in an order
+that is most effective for a particular use case.</p>
+</li>
+</ul>
+</div>
+</li>
+<li>
+<p>AuthorityProvider</p>
+<div class="ulist">
+<ul>
+<li>
+<p>An AuthorityProvide is responsible for determining which privileges and roles, if any,
+a given user should be granted.</p>
+</li>
+</ul>
+</div>
+</li>
+</ul>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="processor_api"><a class="anchor" href="#processor_api"></a>Processor API</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>The Processor is the most widely used Component available in NiFi.
+Processors are the only Component
+to which access is given to create, remove, modify, or inspect
+FlowFiles (data and attributes).</p>
+</div>
+<div class="paragraph">
+<p>All Processors are loaded and instantiated using Java&#8217;s ServiceLoader
+mechanism. This means that all
+Processors must adhere to the following rules:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>The Processor must have a default constructor.</p>
+</li>
+<li>
+<p>The Processor&#8217;s JAR file must contain an entry in the META-INF/services directory named
+<code>org.apache.nifi.processor.Processor</code>. This is a text file where each line contains the
+fully-qualified class name of a Processor.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>While <code>Processor</code> is an interface that can be implemented directly, it
+will be extremely rare to do so, as
+the <code>org.apache.nifi.processor.AbstractProcessor</code> is the base class
+for almost all Processor implementations. The <code>AbstractProcessor</code> class provides a significant
+amount of functionality, which makes the task of developing a Processor much easier and more convenient.
+For the scope of this document, we will focus primarily on the <code>AbstractProcessor</code> class when dealing
+with the Processor API.</p>
+</div>
+<div class="paragraph">
+<div class="title">Concurrency Note</div>
+<p>NiFi is a highly concurrent framework. This means that all extensions
+must be thread-safe. If unfamiliar with writing concurrent software in Java, it is highly
+recommended that you familiarize yourself with the principles of Java concurrency.</p>
+</div>
+<div class="sect2">
+<h3 id="supporting_api"><a class="anchor" href="#supporting_api"></a>Supporting API</h3>
+<div class="paragraph">
+<p>In order to understand the Processor API, we must first understand -
+at least at a high level - several supporting classes and interfaces, which are discussed below.</p>
+</div>
+<div class="sect3">
+<h4 id="flowfile"><a class="anchor" href="#flowfile"></a>FlowFile</h4>
+<div class="paragraph">
+<p>A FlowFile is a logical notion that correlates a piece of data with a
+set of Attributes about that data.
+Such attributes include a FlowFile&#8217;s unique identifier, as well as its
+name, size, and any number of other
+flow-specific values. While the contents and attributes of a FlowFile
+can change, the FlowFile object is
+immutable. Modifications to a FlowFile are made possible by the ProcessSession.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="process_session"><a class="anchor" href="#process_session"></a>ProcessSession</h4>
+<div class="paragraph">
+<p>The ProcessSession, often referred to as simply a "session," provides
+a mechanism by which FlowFiles can be created, destroyed, examined, cloned, and transferred to other
+Processors. Additionally, a ProcessSession provides mechanism for creating modified versions of
+FlowFiles, by adding or removing attributes, or by modifying the FlowFile&#8217;s content. The ProcessSession
+also exposes a mechanism for emitting provenance events that provide for the ability to track the
+lineage and history of a FlowFile. After operations are performed on one or more FlowFiles, a
+ProcessSession can be either committed or rolled back.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="process_context"><a class="anchor" href="#process_context"></a>ProcessContext</h4>
+<div class="paragraph">
+<p>The ProcessContext provides a bridge between a Processor and the framework. It provides information
+about how the Processor is currently configured and allows the Processor to perform
+Framework-specific tasks, such as yielding its resources so that the framework will schedule other
+Processors to run without consuming resources unnecessarily.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="property_descriptor"><a class="anchor" href="#property_descriptor"></a>PropertyDescriptor</h4>
+<div class="paragraph">
+<p>PropertyDescriptor defines a property that is to be used by a
+Processor, ReportingTask, or ControllerService.
+The definition of a property includes its name, a description of the
+property, an optional default value,
+validation logic, and an indicator as to whether or not the property
+is required in order for the Processor
+to be valid. PropertyDescriptors are created by instantiating an
+instance of the <code>PropertyDescriptor.Builder</code>
+class, calling the appropriate methods to fill in the details about
+the property, and finally calling
+the <code>build</code> method.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="validator"><a class="anchor" href="#validator"></a>Validator</h4>
+<div class="paragraph">
+<p>A PropertyDescriptor may specify one or more Validators that can be
+used to ensure that the user-entered value
+for a property is valid. If a Validator indicates that a property
+value is invalid, the Component will not be
+able to be run or used until the property becomes valid.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="validation_context"><a class="anchor" href="#validation_context"></a>ValidationContext</h4>
+<div class="paragraph">
+<p>When validating property values, a ValidationContext can be used to
+obtain ControllerServices,
+create PropertyValue objects, and compile and evaluate property values
+using the Expression Language.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="property_value"><a class="anchor" href="#property_value"></a>PropertyValue</h4>
+<div class="paragraph">
+<p>All property values returned to a Processor are returned in the form
+of a PropertyValue object. This
+object has convenience methods for converting the value from a String
+to other forms, such as numbers
+and time periods, as well as providing an API for evaluating the
+Expression Language.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="relationship"><a class="anchor" href="#relationship"></a>Relationship</h4>
+<div class="paragraph">
+<p>Relationships define the routes to which a FlowFile may be transfered
+from a Processor. Relationships
+are created by instantiating an instance of the <code>Relationship.Builder</code>
+class, calling the appropriate methods
+to fill in the details of the Relationship, and finally calling the
+<code>build</code> method.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="processor_initialization_context"><a class="anchor" href="#processor_initialization_context"></a>ProcessorInitializationContext</h4>
+<div class="paragraph">
+<p>After a Processor is created, its <code>initialize</code> method will be called
+with an <code>InitializationContext</code> object.
+This object exposes configuration to the Processor that will not
+change throughout the life of the Processor,
+such as the unique identifier of the Processor.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="ProcessorLog"><a class="anchor" href="#ProcessorLog"></a>ProcessorLog</h4>
+<div class="paragraph">
+<p>Processors are encouraged to perform their logging via the
+<code>ProcessorLog</code> interface, rather than obtaining
+a direct instance of a third-party logger. This is because logging via
+the ProcessorLog allows the framework
+to render log messages that exceed s a configurable severity level to
+the User Interface, allowing those who
+monitor the dataflow to be notified when important events occur.
+Additionally, it provides a consistent logging
+format for all Processors by logging stack traces when in DEBUG mode
+and providing the Processor&#8217;s unique
+identifier in log messages.</p>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="AbstractProcessor"><a class="anchor" href="#AbstractProcessor"></a>AbstractProcessor API</h3>
+<div class="paragraph">
+<p>Since the vast majority of Processors will be created by extending the
+AbstractProcessor, it is the
+abstract class that we will examine in this section. The
+AbstractProcessor provides several methods that
+will be of interest to Processor developers.</p>
+</div>
+<div class="sect3">
+<h4 id="processor-initialization"><a class="anchor" href="#processor-initialization"></a>Processor Initialization</h4>
+<div class="paragraph">
+<p>When a Processor is created, before any other methods are invoked, the
+<code>init</code> method of the
+AbstractProcessor will be invoked. The method takes a single argument,
+which is of type
+<code>ProcessorInitializationContext</code>. The context object supplies the
+Processor with a ProcessorLog,
+the Processor&#8217;s unique identifier, and a ControllerServiceLookup that
+can be used to interact with the
+configured ControllerServices. Each of these objects is stored by the
+AbstractProcessor and may be obtained by
+subclasses via the <code>getLogger</code>, <code>getIdentifier</code>, and
+<code>getControllerServiceLookup</code> methods, respectively.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="exposing-processor-s-relationships"><a class="anchor" href="#exposing-processor-s-relationships"></a>Exposing Processor&#8217;s Relationships</h4>
+<div class="paragraph">
+<p>In order for a Processor to transfer a FlowFile to a new destination
+for follow-on processing, the
+Processor must first be able to expose to the Framework all of the
+Relationships that it currently supports.
+This allows users of the application to connect Processors to one
+another by creating
+Connections between Processors and assigning the appropriate
+Relationships to those Connections.</p>
+</div>
+<div class="paragraph">
+<p>A Processor exposes the valid set of Relationships by overriding the
+<code>getRelationships</code> method.
+This method takes no arguments and returns a <code>Set</code> of <code>Relationship</code>
+objects. For most Processors, this Set
+will be static, but other Processors will generate the Set
+dynamically, based on user configuration.
+For those Processors for which the Set is static, it is advisable to
+create an immutable Set in the Processor&#8217;s
+constructor or init method and return that value, rather than
+dynamically generating the Set. This
+pattern lends itself to cleaner code and better performance.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="exposing-processor-properties"><a class="anchor" href="#exposing-processor-properties"></a>Exposing Processor Properties</h4>
+<div class="paragraph">
+<p>Most Processors will require some amount of user configuration before
+they are able to be used. The properties
+that a Processor supports are exposed to the Framework via the
+<code>getSupportedPropertyDescriptors</code> method.
+This method takes no arguments and returns a <code>List</code> of
+<code>PropertyDescriptor</code> objects. The order of the objects in the
+List is important in that it dictates the order in which the
+properties will be rendered in the User Interface.</p>
+</div>
+<div class="paragraph">
+<p>A <code>PropertyDescriptor</code> object is constructed by creating a new
+instance of the <code>PropertyDescriptor.Builder</code> object,
+calling the appropriate methods on the builder, and finally calling
+the <code>build</code> method.</p>
+</div>
+<div class="paragraph">
+<p>While this method covers most of the use cases, it is sometimes
+desirable to allow users to configure
+additional properties whose name are not known. This can be achieved
+by overriding the
+<code>getSupportedDynamicPropertyDescriptor</code> method. This method takes a
+<code>String</code> as its only argument, which
+indicates the name of the property. The method returns a
+<code>PropertyDescriptor</code> object that can be used to validate
+both the name of the property, as well as the value. Any
+PropertyDescriptor that is returned from this method
+should be built setting the value of <code>isDynamic</code> to true in the
+<code>PropertyDescriptor.Builder</code> class. The default
+behavior of AbstractProcessor is to not allow any dynamically created
+properties.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="validating-processor-properties"><a class="anchor" href="#validating-processor-properties"></a>Validating Processor Properties</h4>
+<div class="paragraph">
+<p>A Processor is not able to be started if its configuration is not
+valid. Validation of a Processor property can
+be achieved by setting a Validator on a PropertyDescriptor or by
+restricting the allowable values for a
+property via the PropertyDescriptor.Builder&#8217;s <code>allowableValues</code> method
+or <code>identifiesControllerService</code> method.</p>
+</div>
+<div class="paragraph">
+<p>There are times, though, when validating a Processor&#8217;s properties
+individually is not sufficient. For this purpose,
+the AbstractProcessor exposes a <code>customValidate</code> method. The method
+takes a single argument of type <code>ValidationContext</code>.
+The return value of this method is a <code>Collection</code> of
+<code>ValidationResult</code> objects that describe any problems that were
+found during validation. Only those ValidationResult objects whose
+<code>isValid</code> method returns <code>false</code> should be returned.
+This method will be invoked only if all properties are valid according
+to their associated Validators and Allowable Values.
+I.e., this method will be called only if all properties are valid
+in-and-of themselves, and this method allows for
+validation of a Processor&#8217;s configuration as a whole.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="responding-to-changes-in-configuration"><a class="anchor" href="#responding-to-changes-in-configuration"></a>Responding to Changes in Configuration</h4>
+<div class="paragraph">
+<p>It is sometimes desirable to have a Processor eagerly react when its
+properties are changed. The <code>onPropertyModified</code>
+method allows a Processor to do just that. When a user changes the
+property values for a Processor, the
+<code>onPropertyModified</code> method will be called for each modified property.
+The method takes three arguments: the PropertyDescriptor that
+indicates which property was modified,
+the old value, and the new value. If the property had no previous
+value, the second argument will be <code>null</code>. If the property
+was removed, the third argument will be <code>null</code>. It is important to
+note that this method will be called regardless of whether
+or not the values are valid. This method will be called only when a
+value is actually modified, rather than being
+called when a user updates a Processor without changing its value. At
+the point that this method is invoked, it is guaranteed
+that the thread invoking this method is the only thread currently
+executing code in the Processor, unless the Processor itself
+creates its own threads.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="performing-the-work"><a class="anchor" href="#performing-the-work"></a>Performing the Work</h4>
+<div class="paragraph">
+<p>When a Processor has work to do, it is scheduled to do so by having
+its <code>onTrigger</code> method called by the framework.
+The method takes two arguments: a <code>ProcessContext</code> and a
+<code>ProcessSession</code>. The first step in the <code>onTrigger</code> method
+is often to obtain a FlowFile on which the work is to be performed by
+calling one of the <code>get</code> methods on the ProcessSession.
+For Processors that ingest data into NiFi from external sources, this
+step is skipped. The Processor is then free to examine
+FlowFile attributes; add, remove, or modify attributes; read or modify
+FlowFile content; and transfer FlowFiles to the appropriate
+Relationships.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="when-processors-are-triggered"><a class="anchor" href="#when-processors-are-triggered"></a>When Processors are Triggered</h4>
+<div class="paragraph">
+<p>A Processor&#8217;s <code>onTrigger</code> method will be called only when it is
+scheduled to run and when work exists for the Processor.
+Work is said to exist for a Processor if any of the following conditions is met:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>A Connection whose destination is the Processor has at least one
+FlowFile in its queue</p>
+</li>
+<li>
+<p>The Processors has no incoming Connections</p>
+</li>
+<li>
+<p>The Processor is annotated with the @TriggerWhenEmpty annotation</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>Several factors exist that will contribute to when a Processor&#8217;s
+<code>onTrigger</code> method is invoked. First, the Processor will not
+be triggered unless a user has configured the Processor to run. If a
+Processor is scheduled to run, the Framework periodically
+(the period is configured by users in the User Interface) checks if
+there is work for the Processor to do, as described above.
+If so, the Framework will check downstream destinations of the
+Processor. If any of the Processor&#8217;s outbound Connections is full,
+by default, the Processor will not be scheduled to run.</p>
+</div>
+<div class="paragraph">
+<p>However, the <code>@TriggerWhenAnyDestinationAvailable</code> annotation may be
+added to the Processor&#8217;s class. In this case, the requirement
+is changed so that only one downstream destination must be "available"
+(a destination is considered "available" if the Connection&#8217;s
+queue is not full), rather than requiring that all downstream
+destinations be available.</p>
+</div>
+<div class="paragraph">
+<p>Also related to Processor scheduling is the <code>@TriggerSerially</code>
+annotation. Processors that use this Annotation will never have more
+than one thread running the <code>onTrigger</code> method simultaneously. It is
+crucial to note, though, that the thread executing the code
+may change from invocation to invocation. Therefore, care must still
+be taken to ensure that the Processor is thread-safe!</p>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="component-lifecycle"><a class="anchor" href="#component-lifecycle"></a>Component Lifecycle</h3>
+<div class="paragraph">
+<p>The NiFi API provides lifecycle support through use of Java
+Annotations. The <code>org.apache.nifi.annotations.lifecycle</code> package
+contains
+several annotations for lifecycle management. The following
+Annotations may be applied to Java methods in a NiFi component to
+indicate to
+the framework when the methods should be called. For the discussion of
+Component Lifecycle, we will define a NiFi component as a
+Processor, ControllerServices, or ReportingTask.</p>
+</div>
+<div class="sect3">
+<h4 id="onadded"><a class="anchor" href="#onadded"></a>@OnAdded</h4>
+<div class="paragraph">
+<p>The <code>@OnAdded</code> annotation causes a method to be invoked as soon as a
+component is created. The
+component&#8217;s <code>initialize</code> method (or <code>init</code> method, if subclasses
+<code>AbstractProcessor</code>) will be invoked after the component is
+constructed,
+followed by methods that are annotated with <code>@OnAdded</code>. If any method
+annotated with <code>@OnAdded</code> throws an Exception, an error will
+be returned to the user, and that component will not be added to the
+flow. Furthermore, other methods with this
+Annotation will not be invoked. This method will be called only once
+for the lifetime of a component.
+Methods with this Annotation must take zero arguments.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="onremoved"><a class="anchor" href="#onremoved"></a>@OnRemoved</h4>
+<div class="paragraph">
+<p>The <code>@OnRemoved</code> annotation causes a method to be invoked before a
+component is removed from the flow.
+This allows resources to be cleaned up before removing a component.
+Methods with this annotation must take zero arguments.
+If a method with this annotation throws an Exception, the component
+will still be removed.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="onscheduled"><a class="anchor" href="#onscheduled"></a>@OnScheduled</h4>
+<div class="paragraph">
+<p>This annotation indicates that a method should be called every time
+the component is scheduled to run. Because ControllerServices
+are not scheduled, using this annotation on a ControllerService does
+not make sense and will not be honored. It should be
+used only for Processors and Reporting Tasks. If any method with this
+annotation throws an Exception, other methods with this
+annotation will not be invoked, and a notification will be presented
+to the user. In this case, methods annotated with
+<code>@OnUnscheduled</code> are then triggered, followed by methods with the
+<code>@OnStopped</code> annotation (during this state, if any of these
+methods throws an Exception, those Exceptions are ignored). The
+component will then yield its execution for some period of time,
+referred to as the "Administrative Yield Duration," which is a value
+that is configured in the <code>nifi.properties</code> file. Finally, the
+process will start again, until all of the methods annotated with
+<code>@OnScheduled</code> have returned without throwing any Exception.
+Methods with this annotation may take zero arguments or may take a
+single argument. If the single argument variation is used,
+the argument must be of type <code>ProcessContext</code> if the component is a
+Processor or <code>ConfigurationContext</code> if the component
+is a ReportingTask.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="onunscheduled"><a class="anchor" href="#onunscheduled"></a>@OnUnscheduled</h4>
+<div class="paragraph">
+<p>Methods with this annotation will be called whenever a Processor or
+ReportingTask is no longer scheduled to run. At that time, many threads
+may still be active in the Processor&#8217;s <code>onTrigger</code> method. If such a method
+throws an Exception, a log message will be generated, and the
+Exception will be otherwise
+ignored and other methods with this annotation will still be invoked.
+Methods with this annotation may take zero arguments or may take a
+single argument.
+If the single argument variation is used, the argument must be of type
+<code>ProcessContext</code> if the component is a Processor or
+<code>ConfigurationContext</code> if the
+component is a ReportingTask.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="onstopped"><a class="anchor" href="#onstopped"></a>@OnStopped</h4>
+<div class="paragraph">
+<p>Methods with this annotation will be called when a Processor or
+ReportingTask is no longer scheduled to run
+and all threads have returned from the <code>onTrigger</code> method. If such a
+method throws an Exception,
+a lot message will be generated, and the Exception will otherwise be
+ignored; other methods with
+this annotation will still be invoked. Methods with this annotation
+must take zero arguments.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="onshutdown"><a class="anchor" href="#onshutdown"></a>@OnShutdown</h4>
+<div class="paragraph">
+<p>Any method that is annotated with the <code>@OnShutdown</code> annotation will be
+called when NiFi is successfully
+shut down. If such a method throws an Exception, a log message will be
+generated, and the
+Exception will be otherwise ignored and other methods with this
+annotation will still be invoked.
+Methods with this annotation must take zero arguments. Note: while
+NiFi will attempt to invoke methods
+with this annotation on all components that use it, this is not always
+possible. For example, the process
+may be killed unexpectedly, in which case it does not have a chance to
+invoke these methods. Therefore,
+while methods using this annotation can be used to clean up resources,
+for instance, they should not be
+relied upon for critical business logic.</p>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="reporting-processor-activity"><a class="anchor" href="#reporting-processor-activity"></a>Reporting Processor Activity</h3>
+<div class="paragraph">
+<p>Processors are responsible for reporting their activity so that users
+are able to understand what happens
+to their data. Processors should log events via the ProcessorLog,
+which is accessible via the InitializationContext
+or by calling the <code>getLogger</code> method of <code>AbstractProcessor</code>.</p>
+</div>
+<div class="paragraph">
+<p>Additionally, Processors should use the <code>ProvenanceReporter</code>
+interface, obtained via the ProcessSession&#8217;s
+<code>getProvenanceReporter</code> method. The ProvenanceReoprter should be used
+to indicate any time that content is
+received from an external source or sent to an external location. The
+ProvenanceReporter also has methods for
+reporting when a FlowFile is cloned, forked, or modified, and when
+multiple FlowFiles are merged into a single FlowFile
+as well as associating a FlowFile with some other identifier. However,
+these functions are less critical to report, as
+the framework is able to detect these things and emit appropriate
+events on the Processor&#8217;s behalf. Yet, it is a best practice
+for the Processor developer to emit these events, as it becomes
+explicit in the code that these events are being emitted, and
+the developer is able to provide additional details to the events,
+such as the amount of time that the action took or
+pertinent information about the action that was taken. If the
+Processor emits an event, the framework will not emit a duplicate
+event. Instead, it always assumes that the Processor developer knows
+what is happening in the context of the Processor
+better than the framework does. The framework may, however, emit a
+different event. For example, if a Processor modifies both the
+content of a FlowFile and its attributes and then emits only an
+ATTRIBUTES_MODIFIED event, the framework will emit a CONTENT_MODIFIED
+event. The framework will not emit an ATTRIBUTES_MODIFIED event if any
+other event is emitted for that FlowFile (either by the
+Processor or the framework). This is due to the fact that all
+Provenance Events know about the attributes of the FlowFile before the
+event occurred as well as those attributes that occurred as a result
+of the processing of that FlowFile, and as a result the
+ATTRIBUTES_MODIFIED is generally considered redundant and would result
+in a rendering of the FlowFile lineage being very verbose.
+It is, however, acceptable for a Processor to emit this event along
+with others, if the event is considered pertinent from the
+perspective of the Processor.</p>
+</div>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="documenting-a-component"><a class="anchor" href="#documenting-a-component"></a>Documenting a Component</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>NiFi attempts to make the user experience as simple and convenient as
+possible by providing significant amount of documentation
+to the user from within the NiFi application itself via the User
+Interface. In order for this to happen, of course, Processor
+developers must provide that documentation to the framework. NiFi
+exposes a few different mechanisms for supplying documentation to
+the framework.</p>
+</div>
+<div class="sect2">
+<h3 id="documenting-properties"><a class="anchor" href="#documenting-properties"></a>Documenting Properties</h3>
+<div class="paragraph">
+<p>Individual properties can be documented by calling the <code>description</code>
+method of a PropertyDescriptor&#8217;s builder as such:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-java" data-lang="java">public static final PropertyDescriptor MY_PROPERTY = new PropertyDescriptor.Builder()
+  .name("My Property")
+  .description("Description of the Property")
+  ...
+  .build();</code></pre>
+</div>
+</div>
+<div class="paragraph">
+<p>If the property is to provide a set of allowable values, those values
+are presented to the user in a drop-down field in the UI.
+Each of those values can also be given a description:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-java" data-lang="java">public static final AllowableValue EXTENSIVE = new AllowableValue("Extensive", "Extensive",
+	"Everything will be logged - use with caution!");
+public static final AllowableValue VERBOSE = new AllowableValue("Verbose", "Verbose",
+	"Quite a bit of logging will occur");
+public static final AllowableValue REGULAR = new AllowableValue("Regular", "Regular",
+	"Typical logging will occur");
+
+public static final PropertyDescriptor LOG_LEVEL = new PropertyDescriptor.Builder()
+  .name("Amount to Log")
+  .description("How much the Processor should log")
+  .allowableValues(REGULAR, VERBOSE, EXTENSIVE)
+  .defaultValue(REGULAR.getValue())
+  ...
+  .build();</code></pre>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="documenting-relationships"><a class="anchor" href="#documenting-relationships"></a>Documenting Relationships</h3>
+<div class="paragraph">
+<p>Processor Relationships are documented in much the same way that
+properties are - by calling the <code>description</code> method of a
+Relationship&#8217;s builder:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-java" data-lang="java">public static final Relationship MY_RELATIONSHIP = new Relationship.Builder()
+  .name("My Relationship")
+  .description("This relationship is used only if the Processor fails to process the data.")
+  .build();</code></pre>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="documenting-capability-and-keywords"><a class="anchor" href="#documenting-capability-and-keywords"></a>Documenting Capability and Keywords</h3>
+<div class="paragraph">
+<p>The <code>org.apache.nifi.annotations.documentation</code> package provides Java
+annotations that can be used to document components. The
+CapabilityDescription
+annotation can be added to a Processor, Reporting Task, or Controller
+Service and is intended to provide a brief description of the
+functionality
+provided by the component. The Tags annotation has a <code>value</code> variable
+that is defined to be an Array of Strings. As such, it is used
+by providing multiple values as a comma-separated list of `String`s
+with curly braces. These values are then incorporated into the UI by
+allowing
+users to filter the components based on a tag (i.e., a keyword).
+Additionally, the UI provides a tag cloud that allows users to select
+the tags that
+they want to filter by. The tags that are largest in the cloud are
+those tags that exist the most on the components in that instance of
+NiFi. An
+example of using these annotations is provided below:</p>
+</div>
+<div class="listingblock">
+<div class="content">
+<pre class="highlight"><code class="language-java" data-lang="java">@Tags({"example", "documentation", "developer guide", "processor", "tags"})
+@CapabilityDescription("Example Processor that provides no real functionality but is provided" +
+	" for an example in the Developer Guide")
+public static final ExampleProcessor extends Processor {
+    ...
+}</code></pre>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="advanced-documentation"><a class="anchor" href="#advanced-documentation"></a>Advanced Documentation</h3>
+<div class="paragraph">
+<p>When the documentation methods above are not sufficient, NiFi provides
+the ability to expose more advanced documentation to the user via the
+"Usage" documentation. When a user right-clicks on a Processor, NiFi
+provides a "Usage" menu item in the context menu. Additionally, the
+UI exposes a "Help" link in the top-right corner, from which the same
+Usage information can be found.</p>
+</div>
+<div class="paragraph">
+<p>The advanced documentation of a Processor is provided as an HTML file.
+This file should exist within a directory whose name is the
+fully-qualified
+name of the component, and this directory&#8217;s parent should be named
+<code>docs</code> and exist in the root of the Processor&#8217;s jar.
+The mechanism provided for this will be changing as of the 0.1.0
+release. At that time, this section will be updated to reflect
+the new procedures for providing this advanced documentation.</p>
+</div>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="common-processor-patterns"><a class="anchor" href="#common-processor-patterns"></a>Common Processor Patterns</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>While there are many different Processors available to NiFi users, the
+vast majority of them fall into
+one of several common design patterns. Below, we discuss these
+patterns, when the patterns are appropriate,
+reasons we follow these patterns, and things to watch out for when
+applying such patterns. Note that the patterns
+and recommendations discussed below are general guidelines and not
+hardened rules.</p>
+</div>
+<div class="sect2">
+<h3 id="ingress"><a class="anchor" href="#ingress"></a>Data Ingress</h3>
+<div class="paragraph">
+<p>A Processor that ingests data into NiFi has a single Relationship
+names <code>success</code>. This Processor generates
+new FlowFiles via the ProcessSession <code>create</code> method and does not pull
+FlowFiles from incoming Connections.
+The Processor name starts with "Get" or "Listen," depending on whether
+it polls an external source or exposes
+some interface to which external sources can connect. The name ends
+with the protocol used for communications.
+Processors that follow this pattern include <code>GetFile</code>, <code>GetSFTP</code>,
+<code>ListenHTTP</code>, and <code>GetHTTP</code>.</p>
+</div>
+<div class="paragraph">
+<p>This Processor may create or initialize a Connection Pool in a method
+that uses the  <code>@OnScheduled</code> annotation.
+However, because communications problems may prevent connections from
+being established or cause connections
+to be terminated, connections themselves are not created at this
+point. Rather, the connections are
+created or leased from the pool in the <code>onTrigger</code> method.</p>
+</div>
+<div class="paragraph">
+<p>The <code>onTrigger</code> method of this Processor begins by leasing a
+connection from the Connection Pool, if possible,
+or otherwise creates a connection to the external service. When no
+data is available from the
+external source, the <code>yield</code> method of the ProcessContext is called by
+the Processor and the method returns so
+that this Processor avoids continually running and depleting resources
+without benefit. Otherwise, this
+Processor then creates a FlowFile via the ProcessSession&#8217;s <code>create</code>
+method and assigns an appropriate
+filename and path to the FlowFile (by adding the <code>filename</code> and <code>path</code>
+attributes), as well as any other
+attributes that may be appropriate. An OutputStream to the FlowFile&#8217;s content is
+obtained via the ProcessSession&#8217;s <code>write</code> method, passing a new
+OutputStreamCallback (which is usually
+an anonymous inner class). From within this callback, the Processor is
+able to write to the FlowFile and streams
+the content from the external resource to the FlowFile&#8217;s OutputStream.
+If the desire is to write the entire contents
+of an InputStream to the FlowFile, the <code>importFrom</code> method of
+ProcessSession may be more convenient to use than the
+<code>write</code> method.</p>
+</div>
+<div class="paragraph">
+<p>When this Processor expects to receive many small files, it may be
+advisable to create several FlowFiles from a
+single session before committing the session. Typically, this allows
+the Framework to treat the content of the
+newly created FlowFiles much more efficiently.</p>
+</div>
+<div class="paragraph">
+<p>This Processor generates a Provenance event indicating that it has
+received data and specifies from
+where the data came. This Processor should log the creation of the
+FlowFile so that the FlowFile&#8217;s
+origin can be determined by analyzing logs, if necessary.</p>
+</div>
+<div class="paragraph">
+<p>This Processor acknowledges receipt of the data and/or removes the
+data from the external source in order
+to prevent receipt of duplicate files. <strong>This is done only after the
+ProcessSession by which the FlowFile was
+created has been committed!</strong> Failure to adhere to this principle may
+result in data loss, as restarting NiFi
+before the session has been committed will result in the temporary
+file being deleted. Note, however, that it
+is possible using this approach to receive duplicate data because the
+application could be restarted after
+committing the session and before acknowledging or removing the data
+from the external source. In general, though,
+potential data duplication is preferred over potential data loss. The
+connection is finally returned or added to
+the Connection Pool, depending on whether the connection was leased
+from the Connection Pool to begin with or
+was created in the <code>onTrigger</code> method.</p>
+</div>
+<div class="paragraph">
+<p>If there is a communications problem, the connection is typically
+terminated and not returned (or added) to
+the Connection Pool. Connections to remote systems are torn down and
+the Connection Pool shutdown in a method
+annotated with the <code>@OnStopped</code> annotation so that resources can be reclaimed.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="data-egress"><a class="anchor" href="#data-egress"></a>Data Egress</h3>
+<div class="paragraph">
+<p>A Processor that publishes data to an external source has two
+Relationships: <code>success</code> and <code>failure</code>. The
+Processor name starts with "Put" followed by the protocol that is used
+for data transmission. Processors
+that follow this pattern include <code>PutEmail</code>, <code>PutSFTP</code>, and
+<code>PostHTTP</code> (note that the name does not
+begin with "Put" because this would lead to confusion, since PUT and
+POST have special meanings when dealing with
+HTTP).</p>
+</div>
+<div class="paragraph">
+<p>This Processor may create or initialize a Connection Pool in a method
+that uses the  <code>@OnScheduled</code> annotation.
+However, because communications problems may prevent connections from
+being established or cause connections
+to be terminated, connections themselves are not created at this
+point. Rather, the connections are
+created or leased from the pool in the <code>onTrigger</code> method.</p>
+</div>
+<div class="paragraph">
+<p>The <code>onTrigger</code> method first obtains a FlowFile from the
+ProcessSession via the <code>get</code> method. If no FlowFile is
+available, the method returns without obtaining a connection to the
+remote resource.</p>
+</div>
+<div class="paragraph">
+<p>If at least one FlowFile is available, the Processor obtains a
+connection from the Connection Pool, if possible,
+or otherwise creates a new connection. If the Processor is neither
+able to lease a connection from the Connection Pool
+nor create a new connection, the FlowFile is routed to <code>failure</code>, the
+event is logged, and the method returns.</p>
+</div>
+<div class="paragraph">
+<p>If a connection was obtained, the Processor obtains an InputStream to
+the FlowFile&#8217;s content by invoking the
+<code>read</code> method on the ProcessSession and passing an InputStreamCallback
+(which is often an anonymous inner class)
+and from within that callback transmits the contents of the FlowFile
+to the destination. The event is logged
+along with the amount of time taken to transfer the file and the data
+rate at which the file was transferred.
+A SEND event is reported to the ProvenanceReporter by obtaining the
+reporter from the ProcessSession via the
+<code>getProvenanceReporter</code> method and calling the <code>send</code> method on the
+reporter. The connection is returned or added
+to the Connection Pool, depending on whether the connection was leased
+from the pool or newly created by the
+<code>onTrigger</code> method.</p>
+</div>
+<div class="paragraph">
+<p>If there is a communications problem, the connection is typically
+terminated and not returned (or added) to
+the Connection Pool. If there is an issue sending the data to the
+remote resource, the desired approach for handling the
+error depends on a few considerations. If the issue is related to a
+network condition, the FlowFile is generally
+routed to <code>failure</code>. The FlowFile is not penalized because there is
+not necessary a problem with the data. Unlike the
+case of the <a href="#ingress">Data Ingress</a> Processor, we typically do not call <code>yield</code> on
+the ProcessContext. This is because in the case of
+ingest, the FlowFile does not exist until the Processor is able to
+perform its function. However, in the case of a Put Processor,
+the DataFlow Manager may choose to route <code>failure</code> to a different
+Processor. This can allow for a "backup" system to be
+used in the case of problems with one system or can be used for load
+distribution across many systems.</p>
+</div>
+<div class="paragraph">
+<p>If a problem occurs that is data-related, one of two approaches should
+be taken. First, if the problem is likely to
+sort itself out, the FlowFile is penalized and then routed to
+<code>failure</code>. This is the case, for instance, with PutFTP,
+when a FlowFile cannot be transferred because of a file naming
+conflict. The presumption is that the file will eventually
+be removed from the directory so that the new file can be transferred.
+As a result, we penalize the FlowFile and route to
+<code>failure</code> so that we can try again later. In the other case, if there
+is an actual problem with the data (such as the data does
+not conform to some required specification), a different approach may
+be taken. In this case, it may be advantageous
+to break apart the <code>failure</code> relationship into a <code>failure</code> and a
+<code>communications failure</code> relationship. This allows the
+DataFlow Manager to determine how to handle each of these cases
+individually. It is important in these situations to document
+well the differences between the two Relationships by clarifying it in
+the "description" when creating the Relationship.</p>
+</div>
+<div class="paragraph">
+<p>Connections to remote systems are torn down and the Connection Pool
+shutdown in a method
+annotated with <code>@OnStopped</code> so that resources can be reclaimed.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="route-based-on-content-one-to-one"><a class="anchor" href="#route-based-on-content-one-to-one"></a>Route Based on Content (One-to-One)</h3>
+<div class="paragraph">
+<p>A Processor that routes data based on its content will take one of two
+forms: Route an incoming FlowFile to exactly
+one destination, or route incoming data to 0 or more destinations.
+Here, we will discuss the first case.</p>
+</div>
+<div class="paragraph">
+<p>This Processor has two relationships: <code>matched</code> and <code>unmatched</code>. If a
+particular data format is expected, the Processor
+will also have a <code>failure</code> relationship that is used when the input is
+not of the expected format. The Processor exposes
+a Property that indicates the routing criteria.</p>
+</div>
+<div class="paragraph">
+<p>If the Property that specifies routing criteria requires processing,
+such as compiling a Regular Expression, this processing
+is done in a method annotated with <code>@OnScheduled</code>, if possible. The
+result is then stored in a member variable that is marked
+as <code>volatile</code>.</p>
+</div>
+<div class="paragraph">
+<p>The <code>onTrigger</code> method obtains a single FlowFile. The method reads the
+contents of the FlowFile via the ProcessSession&#8217;s <code>read</code>
+method, evaluating the Match Criteria as the data is streamed. The
+Processor then determines whether the FlowFile should be
+routed to <code>matched</code> or <code>unmatched</code> based on whether or not the
+criteria matched, and routes the FlowFile to the appropriate
+relationship.</p>
+</div>
+<div class="paragraph">
+<p>The Processor then emits a Provenance ROUTE event indicating which
+Relationship to which the Processor routed the FlowFile.</p>
+</div>
+<div class="paragraph">
+<p>This Processor is annotated with the <code>@SideEffectFree</code> and
+<code>@SupportsBatching</code> annotations from the <code>org.apache.nifi.annotations.behavior</code>
+package.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="route-based-on-content-one-to-many"><a class="anchor" href="#route-based-on-content-one-to-many"></a>Route Based on Content (One-to-Many)</h3>
+<div class="paragraph">
+<p>If a Processor will route a single FlowFile to potentially many
+relationships, this Processor will be slightly different than
+the above-described Processor for Routing Data Based on Content. This
+Processor typically has Relationships that are dynamically
+defined by the user as well as an <code>unmatched</code> relationship.</p>
+</div>
+<div class="paragraph">
+<p>In order for the user to be able to define additionally Properties,
+the <code>getSupportedDynamicPropertyDescriptor</code> method must be
+overridden. This method returns a PropertyDescriptor with the supplied
+name and an applicable Validator to ensure that the
+user-specified Matching Criteria is valid.</p>
+</div>
+<div class="paragraph">
+<p>In this Processor, the Set of Relationships that is returned by the
+<code>getRelationships</code> method is a member variable that is
+marked <code>volatile</code>. This Set is initially constructed with a single
+Relationship named <code>unmatched</code>. The <code>onPropertyModified</code> method
+is overridden so that when a Property is added or removed, a new
+Relationship is created with the same name. If the Processor has
+Properties that are not user-defined, it is important to check if the
+specified Property is user-defined. This can be achieved by
+calling the <code>isDynamic</code> method of the PropertyDescriptor that is
+passed to this method. If this Property is dynamic,
+a new Set of Relationships is then created, and the previous set of
+Relationships is copied into it. This new Set
+either has the newly created Relationship added to it or removed from
+it, depending on whether a new Property was added
+to the Processor or a Property was removed (Property removal is
+detected by check if the third argument to this function is <code>null</code>).
+The member variable holding the Set of Relationships is then updated
+to point to this new Set.</p>
+</div>
+<div class="paragraph">
+<p>If the Properties that specify routing criteria require processing,
+such as compiling a Regular Expression, this processing is done
+in a method annotated with <code>@OnScheduled</code>, if possible. The result is
+then stored in a member variable that is marked as <code>volatile</code>.
+This member variable is generally of type <code>Map</code> where the key is of
+type <code>Relationship</code> and the value&#8217;s type is defined by the result of
+processing the property value.</p>
+</div>
+<div class="paragraph">
+<p>The <code>onTrigger</code> method obtains a FlowFile via the <code>get</code> method of
+ProcessSession. If no FlowFile is available, it returns immediately.
+Otherwise, a Set of type Relationship is created. The method reads the
+contents of the FlowFile via the ProcessSession&#8217;s <code>read</code> method,
+evaluating each of the Match Criteria as the data is streamed. For any
+criteria that matches, the relationship associated with that Match
+Criteria is added to the Set of Relationships.</p>
+</div>
+<div class="paragraph">
+<p>After reading the contents of the FlowFile, the method checks if the
+Set of Relationships is empty. If so, the original FlowFile has
+an attribute added to it to indicate the Relationship to which it was
+routed and is routed to the <code>unmatched</code>. This is logged, a
+Provenance ROUTE event is emitted, and the method returns. If the size
+of the Set is equal to 1, the original FlowFile has an attribute
+added to it to indicate the Relationship  to which it was routed and
+is routed to the Relationship specified by the entry in the Set.
+This is logged, a Provenance ROUTE event is emitted for the FlowFile,
+and the method returns.</p>
+</div>
+<div class="paragraph">
+<p>In the event that the Set contains more than 1 Relationship, the
+Processor creates a clone of the FlowFile for each Relationship,
+except
+for the first. This is done via the <code>clone</code> method of the
+ProcessSession. There is no need to report a CLONE Provenance Event,
+as the
+framework will handle this for you. The original FlowFile and each
+clone are routed to their appropriate Relationship with attribute
+indicating the name of the Relationship. A Provenance ROUTE event is
+emitted for each FlowFile. This is logged, and the method returns.</p>
+</div>
+<div class="paragraph">
+<p>This Processor is annotated with the <code>@SideEffectFree</code> and
+<code>@SupportsBatching</code> annotations from the
+<code>org.apache.nifi.annotations.behavior</code>
+package.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="route-streams-based-on-content-one-to-many"><a class="anchor" href="#route-streams-based-on-content-one-to-many"></a>Route Streams Based on Content (One-to-Many)</h3>
+<div class="paragraph">
+<p>The previous description of Route Based on Content (One-to-Many)
+provides an abstraction
+for creating a very powerful Processor. However, it assumes that each
+FlowFile will be routed
+in its entirety to zero or more Relationships. What if the incoming
+data format is a "stream" of
+many different pieces of information - and we want to send different
+pieces of this stream to
+different Relationships? For example, imagine that we want to have a
+RouteCSV Processor such that
+it is configured with multiple Regular Expressions. If a line in the
+CSV file matches a Regular
+Expression, that line should be included in the outbound FlowFile to
+the associated relationship.
+If a Regular Expression is associated with the Relationship
+"has-apples" and that Regular Expression
+matches 1,000 of the lines in the FlowFile, there should be one outbound
+FlowFile for the "has-apples" relationship that has 1,000 lines in it.
+If a different Regular Expression
+is associated with the Relationship "has-oranges" and that Regular
+Expression matches 50 lines in the
+FlowFile, there should be one outbound FlowFile for the "has-oranges"
+relationship that has 50 lines in it.
+I.e., one FlowFile comes in and two FlowFiles come out. The two
+FlowFiles may contain some of the same lines
+of text from the original FlowFile, or they may be entirely different.
+This is the type of Processor that
+we will discuss in this section.</p>
+</div>
+<div class="paragraph">
+<p>This Processor&#8217;s name starts with "Route" and ends with the name of
+the data type that it routes. In our
+example here, we are routing CSV data, so the Processor is named
+RouteCSV. This Processor supports dynamic
+properties. Each user-defined property has a name that maps to the
+name of a Relationship. The value of
+the Property is in the format necessary for the "Match Criteria." In
+our example, the value of the property
+must be a valid Regular Expression.</p>
+</div>
+<div class="paragraph">
+<p>This Processor maintains an internal <code>ConcurrentMap</code> where the key is
+a <code>Relationship</code> and the value is of
+a type dependent on the format of the Match Criteria. In our example,
+we would maintain a
+<code>ConcurrentMap&lt;Relationship, Pattern&gt;</code>. This Processor overrides the
+<code>onPropertyModified</code> method.
+If the new value supplied to this method (the third argument) is null,
+the Relationship whose name is
+defined by the property name (the first argument) is removed from the
+ConcurrentMap. Otherwise, the new value
+is processed (in our example, by calling <code>Pattern.compile(newValue)</code>)
+and this value is added to the ConcurrentMap
+with the key again being the Relationship whose name is specified by
+the property name.</p>
+</div>
+<div class="paragraph">
+<p>This Processor will override the <code>customValidate</code> method. In this
+method, it will retrieve all Properties from
+the <code>ValidationContext</code> and count the number of PropertyDescriptors
+that are dynamic (by calling <code>isDynamic()</code>
+on the PropertyDescriptor). If the number of dynamic
+PropertyDescriptors is 0, this indicates that the user
+has not added any Relationships, so the Processor returns a
+<code>ValidationResult</code> indicating that the Processor
+is not valid because it has no Relationships added.</p>
+</div>
+<div class="paragraph">
+<p>The Processor returns all of the Relationships specified by the user
+when its <code>getRelationships</code> method is
+called and will also return an <code>unmatched</code> Relationship. Because this
+Processor will have to read and write to the
+Content Repository (which can be relatively expensive), if this
+Processor is expected to be used for very high
+data volumes, it may be advantageous to add a Property that allows the
+user to specify whether or not they care
+about the data that does not match any of the Match Criteria.</p>
+</div>
+<div class="paragraph">
+<p>When the <code>onTrigger</code> method is called, the Processor obtains a
+FlowFile via <code>ProcessSession.get</code>. If no data
+is available, the Processor returns. Otherwise, the Processor creates
+a <code>Map&lt;Relationship, FlowFile&gt;</code>. We will
+refer to this Map as <code>flowFileMap</code>. The Processor reads the incoming
+FlowFile by calling <code>ProcessSession.read</code>
+and provides an <code>InputStreamCallback</code>.
+From within the Callback, the Processor reads the first piece of data
+from the FlowFile. The Processor then
+evaluates each of the Match Criteria against this piece of data. If a
+particular criteria (in our example,
+a Regular Expression) matches, the Processor obtains the FlowFile from
+<code>flowFileMap</code> that belongs to the appropriate
+Relationship. If no FlowFile yet exists in the Map for this
+Relationship, the Processor creates a new FlowFile
+by calling <code>session.create(incomingFlowFile)</code> and then adds the new
+FlowFile to <code>flowFileMap</code>. The Processor then
+writes this piece of data to the FlowFile by calling <code>session.append</code>
+with an <code>OutputStreamCallback</code>. From within
+this OutputStreamCallback, we have access to the new FlowFile&#8217;s
+OutputStream, so we are able to write the data
+to the new FlowFile. We then return from the OutputStreamCallback.
+After iterating over each of the Match Criteria,
+if none of them match, we perform the same routines as above for the
+<code>unmatched</code> relationship (unless the user
+configures us to not write out unmatched data). Now that we have
+called <code>session.append</code>, we have a new version of
+the FlowFile. As a result, we need to update our <code>flowFileMap</code> to
+associate the Relationship with the new FlowFile.</p>
+</div>
+<div class="paragraph">
+<p>If at any point, an Exception is thrown, we will need to route the
+incoming FlowFile to <code>failure</code>. We will also
+need to remove each of the newly created FlowFiles, as we won&#8217;t be
+transferring them anywhere. We can accomplish
+this by calling <code>session.remove(flowFileMap.values())</code>. At this point,
+we will log the error and return.</p>
+</div>
+<div class="paragraph">
+<p>Otherwise, if all is successful, we can now iterate through the
+<code>flowFileMap</code> and transfer each FlowFile to the
+corresponding Relationship. The original FlowFile is then either
+removed or routed to an <code>original</code> relationship.
+For each of the newly created FlowFiles, we also emit a Provenance
+ROUTE event indicating which Relationship
+the FlowFile went to. It is also helpful to include in the details of
+the ROUTE event how many pieces of information
+were included in this FlowFile. This allows DataFlow Managers to
+easily see when looking at the Provenance
+Lineage view how many pieces of information went to each of the
+relationships for a given input FlowFile.</p>
+</div>
+<div class="paragraph">
+<p>Additionally, some Processors may need to "group" the data that is
+sent to each Relationship so that each FlowFile
+that is sent to a relationship has the same value. In our example, we
+may wan to allow the Regular Expression
+to have a Capturing Group and if two different lines in the CSV match
+the Regular Expression but have different
+values for the Capturing Group, we want them to be added to two
+different FlowFiles. The matching value could then
+be added to each FlowFile as an Attribute. This can be accomplished by
+modifying the <code>flowFileMap</code> such that
+it is defined as <code>Map&lt;Relationship, Map&lt;T, FlowFile&gt;&gt;</code> where <code>T</code> is
+the type of the Grouping Function (in our
+example, the Group would be a <code>String</code> because it is the result of
+evaluating a Regular Expression&#8217;s
+Capturing Group).</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="route-based-on-attributes"><a class="anchor" href="#route-based-on-attributes"></a>Route Based on Attributes</h3>
+<div class="paragraph">
+<p>This Processor is almost identical to the Route Data Based on Content
+Processors described above. It takes two different forms: One-to-One
+and
+One-to-Many, as do the Content-Based Routing Processors. This
+Processor, however, does not make any call to ProcessSession&#8217;s <code>read</code>
+method,
+as it does not read FlowFile content. This Processor is typically very
+fast, so the <code>@SupportsBatching</code> annotation can be very important
+in this case.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="split-content-one-to-many"><a class="anchor" href="#split-content-one-to-many"></a>Split Content (One-to-Many)</h3>
+<div class="paragraph">
+<p>This Processor generally requires no user configuration, with the
+exception of the size of each Split to create. The <code>onTrigger</code> method
+obtains
+a FlowFile from its input queues. A List of type FlowFile is created.
+The original FlowFile is read via the ProcessSession&#8217;s <code>read</code> method,
+and an InputStreamCallback is used. Within the InputStreamCallback,
+the content is read until a point is reached at which the FlowFile
+should be
+split. If no split is needed, the Callback returns, and the original
+FlowFile is routed to <code>success</code>. In this case, a Provenance ROUTE
+event
+is emitted. Typically, ROUTE events are not emitted when routing a
+FlowFile to <code>success</code> because this generates a very verbose lineage
+that
+becomes difficult to navigate. However, in this case,the event is
+useful because we would otherwise expect a FORK event and the absence
+of
+any event is likely to cause confusion. The fact that the FlowFile was
+not split but was instead transferred to <code>success</code> is logged, and the
+method returns.</p>
+</div>
+<div class="paragraph">
+<p>If a point is reached at which a FlowFile needs to be split, a new
+FlowFile is created via the ProcessSession&#8217;s <code>create(FlowFile)</code> method
+or the
+<code>clone(FlowFile, long, long)</code> method. The next section of code depends
+on whether the <code>create</code> method is used or the <code>clone</code> method is used.
+Both methods are described below. Which solution is appropriate must
+be determined on a case-by-case basis.</p>
+</div>
+<div class="paragraph">
+<p>The Create Method is most appropriate when the data will not be
+directly copied from the original FlowFile to the new FlowFile.
+For example, if only some of the data will be copied, or if the data
+will be modified in some way before being copied to the new
+FlowFile, this method is necessary. However, if the content of the new
+FlowFile will be an exact copy of a portion of the original
+FlowFile, the Clone Method is much preferred.</p>
+</div>
+<div class="paragraph">
+<p><strong>Create Method</strong>
+If using the <code>create</code> method, the method is called with the original
+FlowFile as the argument so that the newly created FlowFile will
+inherit
+the attributes of the original FlowFile and a Provenance FORK event
+will be created by the framework.</p>
+</div>
+<div class="paragraph">
+<p>The code then enters a <code>try/finally</code> block. Within the <code>finally</code>
+block, the newly created FlowFile is added to the List of FlowFiles
+that have
+been created. This is done within a <code>finally</code> block so that if an
+Exception is thrown, the newly created FlowFile will be appropriately
+cleaned up.
+Within the <code>try</code> block, the callback initiates a new callback by
+calling the ProcessSession&#8217;s <code>write</code> method with an
+OutputStreamCallback.
+The appropriate data is then copied from the InputStream of the
+original FlowFile to the OutputStream for the new FlowFile.</p>
+</div>
+<div class="paragraph">
+<p><strong>Clone Method</strong>
+If the content of the newly created created FlowFile is to be only a
+contiguous subset of the bytes of the original FlowFile, it is
+preferred
+to use the <code>clone(FlowFile, long, long)</code> method instead of the
+<code>create(FlowFile)</code> method of the ProcessSession. In this case, the
+offset
+of the original FlwoFile at which the new FlowFile&#8217;s content should
+begin is passed as the second argument to the <code>clone</code> method. The
+length
+of the new FlowFile is passed as the third argument to the <code>clone</code>
+method. For example, if the original FlowFile was 10,000 bytes
+and we called <code>clone(flowFile, 500, 100)</code>, the FlowFile that would be
+returned to us would be identical to <code>flowFile</code> with respect to its
+attributes. However, the content of the newly created FlowFile would
+be 100 bytes in length and would start at offset 500 of the original
+FlowFile. That is, the contents of the newly created FlowFile would be
+the same as if you had copied bytes 500 through 599 of the original
+FlowFile.</p>
+</div>
+<div class="paragraph">
+<p>After the clone has been created, it is added to the List of FlowFiles.</p>
+</div>
+<div class="paragraph">
+<p>This method is much more highly preferred than the Create method, when
+applicable,
+because no disk I/O is required. The framework is able to simply
+create a new FlowFile
+that references a subset of the original FlowFile&#8217;s content, rather
+than actually copying
+the data. However, this is not always possible. For example, if header
+information must be copied
+from the beginning of the original FlowFile and added to the beginning
+of each Split,
+then this method is not possible.</p>
+</div>
+<div class="paragraph">
+<p><strong>Both Methods</strong>
+Regardless of whether the Clone Method or the Create Method is used,
+the following is applicable.</p>
+</div>
+<div class="paragraph">
+<p>If at any point in the InputStreamCallback, a condition is reached in
+which processing cannot continue
+(for example, the input is malformed), a <code>ProcessException</code> should be
+thrown. The call to the
+ProcesssSession&#8217;s <code>read</code> method is wrapped in a <code>try/catch</code> block
+where <code>ProcessException</code> is
+caught. If an Exception is caught, a log message is generated
+explaining the error. The List of
+newly created FlowFiles is removed via the ProcessSession&#8217;s <code>remove</code>
+method. The original FlowFile
+is routed to <code>failure</code>.</p>
+</div>
+<div class="paragraph">
+<p>If no problems arise, the original FlowFile is routed to <code>original</code>
+and all newly created FlowFiles
+are updated to include the following attributes:</p>
+</div>
+<table class="tableblock frame-all grid-all spread">
+<colgroup>
+<col style="width: 50%;">
+<col style="width: 50%;">
+</colgroup>
+<thead>
+<tr>
+<th class="tableblock halign-left valign-top">Attribute Name</th>
+<th class="tableblock halign-left valign-top">Description</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>split.parent.uuid</code></p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">The UUID of the original FlowFile</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>split.index</code></p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">A one-up number indicating which FlowFile in the list this is (the first FlowFile
+				  created will have a value <code>0</code>, the second will have a value <code>1</code>, etc.)</p></td>
+</tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>split.count</code></p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">The total number of split FlowFiles that were created</p></td>
+</tr>
+</tbody>
+</table>
+<div class="paragraph">
+<p>The newly created FlowFiles are routed to <code>success</code>; this event is
+logged; and the method returns.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="update-attributes-based-on-content"><a class="anchor" href="#update-attributes-based-on-content"></a>Update Attributes Based on Content</h3>
+<div class="paragraph">
+<p>This Processor is very similar to the Route Based on Content
+Processors discussed above. Rather than
+routing a FlowFile to <code>matched</code> or <code>unmatched</code>, the FlowFile is
+generally routed to <code>success</code> or <code>failure</code>
+and attributes are added to the FlowFile as appropriate. The
+attributes to be added are configured in a
+manner similar to that of the Route Based on Content (One-to-Many),
+with the user defining their own
+properties. The name of the property indicates the name of an
+attribute to add. The value of the
+property indicates some Matching Criteria to be applied to the data.
+If the Matching Criteria matches
+the data, an attribute is added with the name the same as that of the
+Property. The value of the
+attribute is the criteria from the content that matched.</p>
+</div>
+<div class="paragraph">
+<p>For example, a Processor that evaluates XPath Expressions may allow
+user-defined XPaths to be
+entered. If the XPath matches the content of a FlowFile, that FlowFile
+will have an attribute added with
+the name being equal to that of the Property name and a value equal to
+the textual content of the XML Element or
+Attribute that matched the XPath. The <code>failure</code> relationship would
+then be used if the incoming FlowFile
+was not valid XML in this example. The <code>success</code> relationship would be
+used regardless of whether or not
+any matches were found. This can then be used to route the FlowFile
+when appropriate.</p>
+</div>
+<div class="paragraph">
+<p>This Processor emits a Provenance Event of type ATTRIBUTES_MODIFIED.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="enrich-modify-content"><a class="anchor" href="#enrich-modify-content"></a>Enrich/Modify Content</h3>
+<div class="paragraph">
+<p>The Enrich/Modify Content pattern is very common and very generic.
+This pattern is responsible for any
+general content modification. For the majority of cases, this
+Processor is marked with the
+<code>@SideEffectFree</code> and <code>@SupportsBatching</code> annotations. The Processor
+has any number of required and optional
+Properties, depending on the Processor&#8217;s function. The Processor
+generally has a <code>success</code> and <code>failure</code> relationship.
+The <code>failure</code> relationship is generally used when the input file is
+not in the expected format.</p>
+</div>
+<div class="paragraph">
+<p>This Processor obtains a FlowFile and updates it using the
+ProcessSession&#8217;s <code>write(StreamCallback)</code> method
+so that it is able to both read from the FlowFile&#8217;s content and write
+to the next version of the FlowFile&#8217;s
+content. If errors are encountered during the callback, the callback
+will throw a <code>ProcessException</code>. The
+call to the ProcessSession&#8217;s <code>write</code> method is wrapped in a
+<code>try/catch</code> block that catches <code>ProcessException</code>
+and routes the FlowFile to failure.</p>
+</div>
+<div class="paragraph">
+<p>If the callback succeeds, a CONTENT_MODIFIED Provenance Event is emitted.</p>
+</div>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="error-handling"><a class="anchor" href="#error-handling"></a>Error Handling</h2>
+<div class="sectionbody">
+<div class="paragraph">
+<p>When writing a Processor, there are several different unexpected cases that can occur.
+It is important that Processor developers understand the mechanics of how the NiFi framework
+behaves if Processors do not handle errors themselves, and it&#8217;s important to understand
+what error handling is expected of Processors. Here, we will discuss how Processors should
+handle unexpected errors during the course of their work.</p>
+</div>
+<div class="sect2">
+<h3 id="exceptions-within-the-processor"><a class="anchor" href="#exceptions-within-the-processor"></a>Exceptions within the Processor</h3>
+<div class="paragraph">
+<p>During the execution of the <code>onTrigger</code> method of a Processor, many things can potentially go
+awry. Common failure conditions include:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>Incoming data is not in the expected format.</p>
+</li>
+<li>
+<p>Network connections to external services fail.</p>
+</li>
+<li>
+<p>Reading or writing data to a disk fails.</p>
+</li>
+<li>
+<p>There is a bug in the Processor or a dependent library.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>Any of these conditions can result in an Exception being thrown from the Processor. From the framework
+perspective, there are two types of Exceptions that can escape a Processor: <code>ProcessException</code> and
+all others.</p>
+</div>
+<div class="paragraph">
+<p>If a ProcessException is thrown from the Processor, the framework will assume that this is a failure that
+is a known outcome. Moreover, it is a condition where attempting to process the data again later may
+be successful. As a result, the framework will roll back the session that was being processed and penalize
+the FlowFiles that were being processed.</p>
+</div>
+<div class="paragraph">
+<p>If any other Exception escapes the Processor, though, the framework will assume that it is a failure that
+was not taken into account by the developer. In this case, the framework will also roll back the session
+and penalize the FlowFiles. However, in this case, we can get into some very problematic cases. For example,
+the Processor may be in a bad state and may continually run, depleting system resources, without providing
+any useful work. This is fairly common, for instance, when a NullPointerException is thrown continually.
+In order to avoid this case, if an Exception other than ProcessException is able to escape the Processor&#8217;s
+<code>onTrigger</code> method, the framework will also "Administratively Yield" the Processor. This means that the
+Processor will not be triggered to run again for some amount of time. The amount of time is configured
+in the <code>nifi.properties</code> file but is 10 seconds by default.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="exceptions-within-a-callback-ioexception-runtimeexception"><a class="anchor" href="#exceptions-within-a-callback-ioexception-runtimeexception"></a>Exceptions within a callback: IOException, RuntimeException</h3>
+<div class="paragraph">
+<p>More often than not, when an Exception occurs in a Processor, it occurs from within a callback (I.e.,
+<code>InputStreamCallback</code>, <code>OutputStreamCallback</code>, or <code>StreamCallback</code>). That is, during the processing of a
+FlowFile&#8217;s content. Callbacks are allowed to throw either <code>RuntimeException</code> or <code>IOException</code>. In the case
+of RuntimeException, this Exception will propagate back to the <code>onTrigger</code> method. In the case of an
+<code>IOException</code>, the Exception will be wrapped within a ProcessException and this ProcessException will then
+be thrown from the Framework.</p>
+</div>
+<div class="paragraph">
+<p>For this reason, it is recommended that Processors that use callbacks do so within a <code>try/catch</code> block
+and catch <code>ProcessException</code> as well as any other <code>RuntimeException</code> that they expect their callback to
+throw. It is <strong>not</strong> recommended that Processors catch the general <code>Exception</code> or <code>Throwable</code> cases, however.
+This is discouraged for two reasons.</p>
+</div>
+<div class="paragraph">
+<p>First, if an unexpected RuntimeException is thrown, it is likely a bug
+and allowing the framework to rollback the session will ensure no data loss and ensures that DataFlow Managers
+are able to deal with the data as they see fit by keeping the data queued up in place.</p>
+</div>
+<div class="paragraph">
+<p>Second, when an IOException is thrown from a callback, there really are two types of IOExceptions: those thrown
+from Processor code (for example, the data is not in the expected format or a network connection fails), and
+those that are thrown from the Content Repository (where the FlowFile content is stored). If the latter is the case,
+the framework will catch this IOException and wrap it into a <code>FlowFileAccessException</code>, which extends <code>RuntimeException</code>.
+This is done explicitly so that the Exception will escape the <code>onTrigger</code> method and the framework can handle this
+condition appropriately. Catching the general Exception prevents this from happening.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="penalization-vs-yielding"><a class="anchor" href="#penalization-vs-yielding"></a>Penalization vs. Yielding</h3>
+<div class="paragraph">
+<p>When an issue occurs during processing, the framework exposes two methods to allow Processor developers to avoid performing
+unnecessary work: "penalization" and "yielding." These two concepts can become confusing for developers new to the NiFi API.
+A developer is able to penalize a FlowFile by calling the <code>penalize(FlowFile)</code> method of ProcessSession. This causes the
+FlowFile itself to be inaccessible to downstream Processors for a period of time. The amount of time that the FlowFile is
+inaccessible is determined by the DataFlow Manager by setting the "Penalty Duration" setting in the Processor Configuration
+dialog. The default value is 30 seconds. Typically, this is done when a Processor determines that the data cannot be processed
+due to environmental reasons that are expected to sort themselves out. A great example of this is the PutSFTP processor, which
+will penalize a FlowFile if a file already exists on the SFTP server that has the same filename. In this case, the Processor
+penalizes the FlowFile and routes it to failure. A DataFlow Manager can then route failure back to the same PutSFTP Processor.
+This way, if a file exists with the same filename, the Processor will not attempt to send the file again for 30 seconds
+(or whatever period the DFM has configured the Processor to use). In the meantime, it is able to continue to process other
+FlowFiles.</p>
+</div>
+<div class="paragraph">
+<p>On the other hand, yielding allows a Processor developer to indicate to the framework that it will not be able to perform
+any useful function for some period of time. This commonly happens with a Processor that is communicating with a remote
+resource. If the Processor cannot connect to the remote resource, or if the remote resource is expected to provide data
+but reports that it has none, the Processor should call <code>yield</code> on the <code>ProcessContext</code> object and then return. By doing
+this, the Processor is telling the framework that it should not waste resources triggering this Processor to run, because
+there&#8217;s nothing that it can do - it&#8217;s better to use those resources to allow other Processors to run.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="session-rollback"><a class="anchor" href="#session-rollback"></a>Session Rollback</h3>
+<div class="paragraph">
+<p>Thus far, when we have discussed the <code>ProcessSession</code>, we have typically referred to it simply as a mechanism for accessing
+FlowFiles. However, it provides another very important capability, which is transactionality. All methods that are called
+on a ProcessSession happen as a transaction. When we decided to end the transaction, we can do so either by calling
+<code>commit()</code> or by calling <code>rollback()</code>. Typically, this is handled by the <code>AbstractProcessor</code> class: if the <code>onTrigger</code> method
+throws an Exception, the AbstractProcessor will catch the Exception, call <code>session.rollback()</code>, and then re-throw the Exception.
+Otherwise, the AbstractProcessor will call <code>commit()</code> on the ProcessSession.</p>
+</div>
+<div class="paragraph">
+<p>There are times, however, that developers will want to roll back a session explicitly. This can be accomplished at any time
+by calling the <code>rollback()</code> or <code>rollback(boolean)</code> method. If using the latter, the boolean indicates whether or not those
+FlowFiles that have been pulled from queues (via the ProcessSession <code>get</code> methods) should be penalized before being added
+back to their queues.</p>
+</div>
+<div class="paragraph">
+<p>When <code>rollback</code> is called, any modification that has occurred to the FlowFiles in that session are discarded, to included
+both content modification and attribute modification. Additionally, all Provenance Events are rolled back (with the exception
+of any SEND event that was emitted by passing a value of <code>true</code> for the <code>force</code> argument). The FlowFiles that were pulled from
+the input queues are then transferred back to the input queues (and optionally penalized) so that they can be processed again.</p>
+</div>
+<div class="paragraph">
+<p>On the other hand, when the <code>commit</code> method is called, the FlowFile&#8217;s new state is persisted in the FlowFile Repository, and
+any Provenance Events that occurred are persisted in the Provenance Repository. The previous content is destroyed (unless
+another FlowFile references the same piece of content), and the FlowFiles are transferred to the outbound queues so that the
+next Processors can operate on the data.</p>
+</div>
+<div class="paragraph">
+<p>It is also important to note how this behavior is affected by using the <code>org.apache.nifi.annotations.behavior.SupportsBatching</code>
+annotation. If a Processor utilizes this annotation, calls to <code>ProcessSession.commit</code> may not take affect immediately. Rather,
+these commits may be batched together in order to provide higher throughput. However, if at any point, the Processor rolls back
+the ProcessSession, all changes since the last call to <code>commit</code> will be discarded and all "batched" commits will take affect.
+These "batched" commits are not rolled back.</p>
+</div>
+</div>
+</div>
+</div>
+<div class="sect1">
+<h2 id="general-design-considerations"><a class="anchor" href="#general-design-considerations"></a>General Design Considerations</h2>
 <div class="sectionbody">
 <div class="paragraph">
-<p>Processor, Prioritizer, &#8230;&#8203;</p>

[... 796 lines stripped ...]


Mime
View raw message