falcon-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From srik...@apache.org
Subject svn commit: r1624488 [2/7] - in /incubator/falcon: site/ site/0.3-incubating/ site/0.4-incubating/ site/docs/ site/docs/restapi/ trunk/ trunk/general/src/site/twiki/docs/ trunk/general/src/site/twiki/docs/restapi/
Date Fri, 12 Sep 2014 09:43:51 GMT
Modified: incubator/falcon/site/docs/EntitySpecification.html
URL: http://svn.apache.org/viewvc/incubator/falcon/site/docs/EntitySpecification.html?rev=1624488&r1=1624487&r2=1624488&view=diff
==============================================================================
--- incubator/falcon/site/docs/EntitySpecification.html (original)
+++ incubator/falcon/site/docs/EntitySpecification.html Fri Sep 12 09:43:48 2014
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia at 2014-07-05
+ | Generated by Apache Maven Doxia at 2014-09-12
  | Rendered using Apache Maven Fluido Skin 1.3.0
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20140705" />
+    <meta name="Date-Revision-yyyymmdd" content="20140912" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Falcon - Contents</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
@@ -239,7 +239,7 @@
         
                 
                     
-                  <li id="publishDate" class="pull-right">Last Published: 2014-07-05</li> 
+                  <li id="publishDate" class="pull-right">Last Published: 2014-09-12</li> 
             
                             </ul>
       </div>
@@ -257,7 +257,7 @@
 <li><a href="#Process_Specification">Process Specification</a></li></ul></div>
 <div class="section">
 <h3>Cluster Specification<a name="Cluster_Specification"></a></h3>
-<p>The <a class="externalLink" href="https://git-wip-us.apache.org/repos/asf?p=incubator-falcon.git;a=blob_plain;f=client/src/main/resources/cluster-0.1.xsd;hb=HEAD">Cluster XSD specification</a> is available here: A cluster contains different interfaces which are used by Falcon like readonly, write, workflow and messaging. A cluster is referenced by feeds and processes which are on-boarded to Falcon by its name.</p>
+<p>The cluster XSD specification is available here: A cluster contains different interfaces which are used by Falcon like readonly, write, workflow and messaging. A cluster is referenced by feeds and processes which are on-boarded to Falcon by its name.</p>
 <p>Following are the tags defined in a cluster.xml:</p>
 <div class="source">
 <pre>
@@ -265,8 +265,10 @@
  xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;&gt;
 
 </pre></div>
-<p>The colo specifies the colo to which this cluster belongs to and name is the name of the cluster which has to  be unique.</p>
-<p>A cluster has varies interfaces as described below:</p>
+<p>The colo specifies the colo to which this cluster belongs to and name is the name of the cluster which has to  be unique.</p></div>
+<div class="section">
+<h4>Interfaces<a name="Interfaces"></a></h4>
+<p>A cluster has various interfaces as described below:</p>
 <div class="source">
 <pre>
     &lt;interface type=&quot;readonly&quot; endpoint=&quot;hftp://localhost:50010&quot; version=&quot;0.20.2&quot; /&gt;
@@ -284,7 +286,7 @@
 &lt;interface type=&quot;execute&quot; endpoint=&quot;localhost:8021&quot; version=&quot;0.20.2&quot; /&gt;
 
 </pre></div>
-<p>An execute interface specifies the interface for job tracker, it's endpoint is the value of mapred.job.tracker.  Falcon uses this interface to submit the processes as jobs on <a href="./JobTracker.html">JobTracker</a> defined here.</p>
+<p>An execute interface specifies the interface for job tracker, it's endpoint is the value of mapred.job.tracker.  Falcon uses this interface to submit the processes as jobs on JobTracker defined here.</p>
 <div class="source">
 <pre>
 &lt;interface type=&quot;workflow&quot; endpoint=&quot;http://localhost:11000/oozie/&quot; version=&quot;3.1&quot; /&gt;
@@ -302,14 +304,28 @@
 &lt;interface type=&quot;messaging&quot; endpoint=&quot;tcp://localhost:61616?daemon=true&quot; version=&quot;5.4.6&quot; /&gt;
 
 </pre></div>
-<p>A messaging interface specifies the interface for sending feed availability messages, it's endpoint is broker url with tcp address.</p>
+<p>A messaging interface specifies the interface for sending feed availability messages, it's endpoint is broker url with tcp address.</p></div>
+<div class="section">
+<h4>Locations<a name="Locations"></a></h4>
 <p>A cluster has a list of locations defined:</p>
 <div class="source">
 <pre>
 &lt;location name=&quot;staging&quot; path=&quot;/projects/falcon/staging&quot; /&gt;
+&lt;location name=&quot;working&quot; path=&quot;/projects/falcon/working&quot; /&gt;
+
+</pre></div>
+<p>Location has the name and the path, name is the type of locations like staging, temp and working. and path is the hdfs path for each location. Falcon would use the location to do intermediate processing of entities in hdfs and hence Falcon should have read/write/execute permission on these locations.</p></div>
+<div class="section">
+<h4>ACL<a name="ACL"></a></h4>
+<p>A cluster has ACL (Access Control List) useful for implementing permission requirements and provide a way to set different permissions for specific users or named groups.</p>
+<div class="source">
+<pre>
+    &lt;ACL owner=&quot;test-user&quot; group=&quot;test-group&quot; permission=&quot;*&quot;/&gt;
 
 </pre></div>
-<p>Location has the name and the path, name is the type of locations like staging, temp and working. and path is the hdfs path for each location. Falcon would use the location to do intermediate processing of entities in hdfs and hence Falcon should have read/write/execute permission on these locations.</p>
+<p>ACL indicates the Access control list for this cluster. owner is the Owner of this entity. group is the one which has access to read. permission indicates the permission.</p></div>
+<div class="section">
+<h4>Custom Properties<a name="Custom_Properties"></a></h4>
 <p>A cluster has a list of properties: A key-value pair, which are propagated to the workflow engine.</p>
 <div class="source">
 <pre>
@@ -319,7 +335,7 @@
 <p>Ideally JMS impl class name of messaging engine (brokerImplClass)  should be defined here.</p></div>
 <div class="section">
 <h3>Feed Specification<a name="Feed_Specification"></a></h3>
-<p>The <a class="externalLink" href="https://git-wip-us.apache.org/repos/asf?p=incubator-falcon.git;a=blob_plain;f=client/src/main/resources/feed-0.1.xsd;hb=HEAD">Feed XSD specification</a> is available here. a Feed defines various attributes of feed like feed location, frequency, late-arrival handling and retention policies. A feed can be scheduled on a cluster, once a feed is scheduled its retention and replication process are triggered in a given cluster.</p>
+<p>The Feed XSD specification is available here. a Feed defines various attributes of feed like feed location, frequency, late-arrival handling and retention policies. A feed can be scheduled on a cluster, once a feed is scheduled its retention and replication process are triggered in a given cluster.</p>
 <div class="source">
 <pre>
 &lt;feed description=&quot;clicks log&quot; name=&quot;clicks&quot; xmlns=&quot;uri:falcon:feed:0.1&quot;
@@ -364,7 +380,7 @@ xmlns:xsi=&quot;http://www.w3.org/2001/X
  &lt;location type=&quot;meta&quot; path=&quot;/projects/falcon/clicksMetaData&quot; /&gt;
 
 </pre></div>
-<p>A location tag specifies the type of location like data, meta, stats and the corresponding paths for them. A feed should at least define the location for type data, which specifies the HDFS path pattern where the feed is generated periodically. ex: type=&quot;data&quot; path=&quot;/projects/TrafficHourly/${YEAR}-${MONTH}-${DAY}/traffic&quot; The granularity of date pattern in the path should be atleast that of a frequency of a feed. Other location type which are supported are stats and meta paths, if a process references a feed then the meta and stats paths are available as a property in a process.</p></div>
+<p>A location tag specifies the type of location like data, meta, stats and the corresponding paths for them. A feed should at least define the location for type data, which specifies the HDFS path pattern where the feed is generated periodically. ex: type=&quot;data&quot; path=&quot;/projects/TrafficHourly/${YEAR}-${MONTH}-${DAY}/traffic&quot; The granularity of date pattern in the path should be at least that of a frequency of a feed. Other location type which are supported are stats and meta paths, if a process references a feed then the meta and stats paths are available as a property in a process.</p></div>
 <div class="section">
 <h5>Catalog Storage (Table)<a name="Catalog_Storage_Table"></a></h5>
 <p>A table tag specifies the table URI in the catalog registry as:</p>
@@ -408,7 +424,7 @@ catalog:$database-name:$table-name#parti
 
 </pre></div>
 <p>A feed can define multiple partitions, if a referenced cluster defines partitions then the number of partitions in feed has to be equal to or more than the cluster partitions.</p>
-<p><b>Note:</b> This will only apply for <a href="./FileSystem.html">FileSystem</a> storage but not Table storage as partitions are defined and maintained in Hive (Hcatalog) registry.</p></div>
+<p><b>Note:</b> This will only apply for FileSystem storage but not Table storage as partitions are defined and maintained in Hive (HCatalog) registry.</p></div>
 <div class="section">
 <h4>Groups<a name="Groups"></a></h4>
 <div class="source">
@@ -441,9 +457,18 @@ catalog:$database-name:$table-name#parti
 
 </pre></div>
 <p>A late-arrival specifies the cut-off period till which the feed is expected to arrive late and should be honored be processes referring to it as input feed by rerunning the instances in case the data arrives late with in a cut-off period. The cut-off period is specified by expression frequency(times), ex: if the feed can arrive late upto 8 hours then late-arrival's cut-off=&quot;hours(8)&quot;</p>
-<p><b>Note:</b> This will only apply for <a href="./FileSystem.html">FileSystem</a> storage but not Table storage until a future time.</p></div>
+<p><b>Note:</b> This will only apply for FileSystem storage but not Table storage until a future time.</p></div>
 <div class="section">
-<h5>Custom Properties<a name="Custom_Properties"></a></h5>
+<h4>ACL<a name="ACL"></a></h4>
+<p>A feed has ACL (Access Control List) useful for implementing permission requirements and provide a way to set different permissions for specific users or named groups.</p>
+<div class="source">
+<pre>
+    &lt;ACL owner=&quot;test-user&quot; group=&quot;test-group&quot; permission=&quot;*&quot;/&gt;
+
+</pre></div>
+<p>ACL indicates the Access control list for this cluster. owner is the Owner of this entity. group is the one which has access to read. permission indicates the permission.</p></div>
+<div class="section">
+<h4>Custom Properties<a name="Custom_Properties"></a></h4>
 <div class="source">
 <pre>
     &lt;properties&gt;
@@ -453,17 +478,18 @@ catalog:$database-name:$table-name#parti
         &lt;property name=&quot;jobPriority&quot; value=&quot;VERY_HIGH&quot;/&gt;
         &lt;property name=&quot;timeout&quot; value=&quot;hours(1)&quot;/&gt;
         &lt;property name=&quot;parallel&quot; value=&quot;3&quot;/&gt;
+        &lt;property name=&quot;maxMaps&quot; value=&quot;8&quot;/&gt;
+        &lt;property name=&quot;mapBandwidthKB&quot; value=&quot;1024&quot;/&gt;
     &lt;/properties&gt;
 
 </pre></div>
-<p>A key-value pair, which are propagated to the workflow engine. &quot;queueName&quot; and &quot;jobPriority&quot; are special properties available to user to specify the hadoop job queue and priority, the same value is used by Falcons launcher job. &quot;timeout&quot; and &quot;parallel&quot; are other special properties which decides replication instance's timeout value while waiting for the feed instance and parallel decides the concurrent replication instances that can run at any given time.</p></div>
+<p>A key-value pair, which are propagated to the workflow engine. &quot;queueName&quot; and &quot;jobPriority&quot; are special properties available to user to specify the Hadoop job queue and priority, the same value is used by Falcons launcher job. &quot;timeout&quot; and &quot;parallel&quot; are other special properties which decides replication instance's timeout value while waiting for the feed instance and parallel decides the concurrent replication instances that can run at any given time. &quot;maxMaps&quot; represents the maximum number of maps used during replication. &quot;mapBandwidthKB&quot; represents the bandwidth in KB/s used by each mapper during replication.</p></div>
 <div class="section">
 <h3>Process Specification<a name="Process_Specification"></a></h3>
-<p>The <a class="externalLink" href="https://git-wip-us.apache.org/repos/asf?p=incubator-falcon.git;a=blob_plain;f=client/src/main/resources/process-0.1.xsd;hb=HEAD">Process XSD specification</a> is available here.</p>
 <p>A process defines configuration for a workflow. A workflow is a directed acyclic graph(DAG) which defines the job for the workflow engine. A process definition defines  the configurations required to run the workflow job. For example, process defines the frequency at which the workflow should run, the clusters on which the workflow should run, the inputs and outputs for the workflow, how the workflow failures should be handled, how the late inputs should be handled and so on.</p>
 <p>The different details of process are:</p></div>
 <div class="section">
-<h5>Name<a name="Name"></a></h5>
+<h4>Name<a name="Name"></a></h4>
 <p>Each process is identified with a unique name. Syntax:</p>
 <div class="source">
 <pre>
@@ -473,7 +499,25 @@ catalog:$database-name:$table-name#parti
 
 </pre></div></div>
 <div class="section">
-<h5>Cluster<a name="Cluster"></a></h5>
+<h4>Tags<a name="Tags"></a></h4>
+<p>An optional list of comma separated tags which are used for classification of processes. Syntax:</p>
+<div class="source">
+<pre>
+...
+    &lt;tags&gt;consumer=consumer@xyz.com, owner=producer@xyz.com, department=forecasting&lt;/tags&gt;
+
+</pre></div></div>
+<div class="section">
+<h4>Pipelines<a name="Pipelines"></a></h4>
+<p>An optional list of comma separated word strings, specifies the data processing pipeline(s) to which this process belongs. Only letters, numbers and underscore are allowed for pipeline string. Syntax:</p>
+<div class="source">
+<pre>
+...
+    &lt;pipelines&gt;test_Pipeline, dataReplication, clickStream_pipeline&lt;/pipelines&gt;
+
+</pre></div></div>
+<div class="section">
+<h4>Cluster<a name="Cluster"></a></h4>
 <p>The cluster on which the workflow should run. A process should contain one or more clusters. Cluster definition for the cluster name gives the end points for workflow execution, name node, job tracker, messaging and so on. Each cluster inturn has validity mentioned, which tell the times between which the job should run on that specified cluster.  Syntax:</p>
 <div class="source">
 <pre>
@@ -495,19 +539,19 @@ catalog:$database-name:$table-name#parti
 
 </pre></div></div>
 <div class="section">
-<h5>Parallel<a name="Parallel"></a></h5>
-<p>Parallel defines how many instances of the workflow can run concurrently. It should be a positive interger &gt; 0. For example, concurrency of 1 ensures that only one instance of the workflow can run at a time. The next instance will start only after the running instance completes. Syntax:</p>
+<h4>Parallel<a name="Parallel"></a></h4>
+<p>Parallel defines how many instances of the workflow can run concurrently. It should be a positive integer &gt; 0. For example, parallel of 1 ensures that only one instance of the workflow can run at a time. The next instance will start only after the running instance completes. Syntax:</p>
 <div class="source">
 <pre>
 &lt;process name=&quot;[process name]&quot;&gt;
 ...
-   &lt;concurrency&gt;[concurrency]&lt;/concurrency&gt;
+   &lt;parallel&gt;[parallel]&lt;/parallel&gt;
 ...
 &lt;/process&gt;
 
 </pre></div></div>
 <div class="section">
-<h5>Order<a name="Order"></a></h5>
+<h4>Order<a name="Order"></a></h4>
 <p>Order defines the order in which the ready instances are picked up. The possible values are FIFO(First In First Out), LIFO(Last In First Out), and ONLYLAST(Last Only). Syntax:</p>
 <div class="source">
 <pre>
@@ -519,7 +563,7 @@ catalog:$database-name:$table-name#parti
 
 </pre></div></div>
 <div class="section">
-<h5>Timeout<a name="Timeout"></a></h5>
+<h4>Timeout<a name="Timeout"></a></h4>
 <p>A optional Timeout specifies the maximum time an instance waits for a dataset before being killed by the workflow engine, a time out is specified like frequency. If timeout is not specified, falcon computes a default timeout for a process based on its frequency, which is six times of the frequency of process or 30 minutes if computed timeout is less than 30 minutes.</p>
 <div class="source">
 <pre>
@@ -531,7 +575,7 @@ catalog:$database-name:$table-name#parti
 
 </pre></div></div>
 <div class="section">
-<h5>Frequency<a name="Frequency"></a></h5>
+<h4>Frequency<a name="Frequency"></a></h4>
 <p>Frequency defines how frequently the workflow job should run. For example, hours(1) defines the frequency as hourly, days(7) defines weekly frequency. The values for timeunit can be minutes/hours/days/months and the frequency number should be a positive integer &gt; 0.  Syntax:</p>
 <div class="source">
 <pre>
@@ -543,7 +587,7 @@ catalog:$database-name:$table-name#parti
 
 </pre></div></div>
 <div class="section">
-<h5>Validity<a name="Validity"></a></h5>
+<h4>Validity<a name="Validity"></a></h4>
 <p>Validity defines how long the workflow should run. It has 3 components - start time, end time and timezone. Start time and end time are timestamps defined in yyyy-MM-dd'T'HH:mm'Z' format and should always be in UTC. Timezone is used to compute the next instances starting from start time. The workflow will start at start time and end before end time specified on a given cluster. So, there will not be a workflow instance at end time. Syntax:</p>
 <div class="source">
 <pre>
@@ -578,7 +622,7 @@ catalog:$database-name:$table-name#parti
 </pre></div>
 <p>The hourly workflow will start on March 11th 2012 at 00:40 PST, the next instances will be at 01:40 PST, 03:40 PDT, 04:40 PDT and so on till 23:40 PDT. So, there will be just 23 instances of the workflow for March 11th 2012 because of DST switch.</p></div>
 <div class="section">
-<h5>Inputs<a name="Inputs"></a></h5>
+<h4>Inputs<a name="Inputs"></a></h4>
 <p>Inputs define the input data for the workflow. The workflow job will start executing only after the schedule time and when all the inputs are available. There can be 0 or more inputs and each of the input maps to a feed. The path and frequency of input data is picked up from feed definition. Each input should also define start and end instances in terms of <a href="./FalconDocumentation.html">EL expressions</a> and can optionally specify specific partition of input that the workflow requires. The components in partition should be subset of partitions defined in the feed.</p>
 <p>For each input, Falcon will create a property with the input name that contains the comma separated list of input paths. This property can be used in workflow actions like pig scripts and so on.</p>
 <p>Syntax:</p>
@@ -675,7 +719,7 @@ catalog:$database-name:$table-name#parti
 
 </pre></div></div>
 <div class="section">
-<h5>Optional Inputs<a name="Optional_Inputs"></a></h5>
+<h4>Optional Inputs<a name="Optional_Inputs"></a></h4>
 <p>User can mention one or more inputs as optional inputs. In such cases the job does not wait on those inputs which are mentioned as optional. If they are present it considers them otherwise continue with the compulsory ones. Example:</p>
 <div class="source">
 <pre>
@@ -701,9 +745,9 @@ catalog:$database-name:$table-name#parti
 &lt;/process&gt;
 
 </pre></div>
-<p><b>Note:</b> This is only supported for <a href="./FileSystem.html">FileSystem</a> storage but not Table storage at this point.</p></div>
+<p><b>Note:</b> This is only supported for FileSystem storage but not Table storage at this point.</p></div>
 <div class="section">
-<h5>Outputs<a name="Outputs"></a></h5>
+<h4>Outputs<a name="Outputs"></a></h4>
 <p>Outputs define the output data that is generated by the workflow. A process can define 0 or more outputs. Each output is mapped to a feed and the output path is picked up from feed definition. The output instance that should be generated is specified in terms of <a href="./FalconDocumentation.html">EL expression</a>.</p>
 <p>For each output, Falcon creates a property with output name that contains the path of output data. This can be used in workflows to store in the path. Syntax:</p>
 <div class="source">
@@ -787,21 +831,21 @@ catalog:$database-name:$table-name#parti
 
 </pre></div></div>
 <div class="section">
-<h5>Properties<a name="Properties"></a></h5>
-<p>The properties are key value pairs that are passed to the workflow. These properties are optional and can be used in workflow to parameterize the workflow. Synatx:</p>
+<h4>Custom Properties<a name="Custom_Properties"></a></h4>
+<p>The properties are key value pairs that are passed to the workflow. These properties are optional and can be used in workflow to parameterize the workflow. Syntax:</p>
 <div class="source">
 <pre>
 &lt;process name=&quot;[process name]&quot;&gt;
 ...
     &lt;properties&gt;
-        &lt;propery name=[key] value=[value]/&gt;
+        &lt;property name=[key] value=[value]/&gt;
         ...
     &lt;/properties&gt;
 ...
 &lt;/process&gt;
 
 </pre></div>
-<p>queueName and jobPriority are special properites, which when present are used by the Falcon's launcher job, the same property is also availalble in workflow which can be used to propogate to pig or M/R job.</p>
+<p>queueName and jobPriority are special properties, which when present are used by the Falcon's launcher job, the same property is also available in workflow which can be used to propagate to pig or M/R job.</p>
 <div class="source">
 <pre>
         &lt;property name=&quot;queueName&quot; value=&quot;hadoopQueue&quot;/&gt;
@@ -809,13 +853,13 @@ catalog:$database-name:$table-name#parti
 
 </pre></div></div>
 <div class="section">
-<h5>Workflow<a name="Workflow"></a></h5>
+<h4>Workflow<a name="Workflow"></a></h4>
 <p>The workflow defines the workflow engine that should be used and the path to the workflow on hdfs. The workflow definition on hdfs contains the actual job that should run and it should confirm to the workflow specification of the engine specified. The libraries required by the workflow should be in lib folder inside the workflow path.</p>
 <p>The properties defined in the cluster and cluster properties(nameNode and jobTracker) will also be available for the workflow.</p>
 <p>There are 2 engines supported today.</p></div>
 <div class="section">
-<h6>Oozie<a name="Oozie"></a></h6>
-<p>As part of oozie workflow engine support, users can embed a oozie workflow. Refer to oozie <a class="externalLink" href="http://incubator.apache.org/oozie/overview.html">workflow overview</a> and <a class="externalLink" href="http://incubator.apache.org/oozie/docs/3.1.3/docs/WorkflowFunctionalSpec.html">workflow specification</a> for details.</p>
+<h5>Oozie<a name="Oozie"></a></h5>
+<p>As part of oozie workflow engine support, users can embed a oozie workflow. Refer to oozie <a class="externalLink" href="http://oozie.apache.org/docs/4.0.0/DG_Overview.html">workflow overview</a> and <a class="externalLink" href="http://oozie.apache.org/docs/4.0.0/WorkflowFunctionalSpec.html">workflow specification</a> for details.</p>
 <p>Syntax:</p>
 <div class="source">
 <pre>
@@ -838,7 +882,7 @@ catalog:$database-name:$table-name#parti
 </pre></div>
 <p>This defines the workflow engine to be oozie and the workflow xml is defined at /projects/bootcamp/workflow/workflow.xml. The libraries are at /projects/bootcamp/workflow/lib.</p></div>
 <div class="section">
-<h6>Pig<a name="Pig"></a></h6>
+<h5>Pig<a name="Pig"></a></h5>
 <p>Falcon also adds the Pig engine which enables users to embed a Pig script as a process.</p>
 <p>Example:</p>
 <div class="source">
@@ -856,7 +900,7 @@ catalog:$database-name:$table-name#parti
 <pre>$input_filter
 </pre></div></div>
 <div class="section">
-<h6>Hive<a name="Hive"></a></h6>
+<h5>Hive<a name="Hive"></a></h5>
 <p>Falcon also adds the Hive engine as part of Hive Integration which enables users to embed a Hive script as a process. This would enable users to create materialized queries in a declarative way.</p>
 <p>Example:</p>
 <div class="source">
@@ -874,7 +918,7 @@ catalog:$database-name:$table-name#parti
 <pre>$input_filter
 </pre></div></div>
 <div class="section">
-<h5>Retry<a name="Retry"></a></h5>
+<h4>Retry<a name="Retry"></a></h4>
 <p>Retry policy defines how the workflow failures should be handled. Two retry policies are defined: backoff and exp-backoff(exponential backoff). Depending on the delay and number of attempts, the workflow is re-tried after specific intervals. Syntax:</p>
 <div class="source">
 <pre>
@@ -897,7 +941,7 @@ catalog:$database-name:$table-name#parti
 </pre></div>
 <p>The workflow is re-tried after 10 mins, 20 mins and 30 mins. With exponential backoff, the workflow will be re-tried after 10 mins, 20 mins and 40 mins.</p></div>
 <div class="section">
-<h5>Late data<a name="Late_data"></a></h5>
+<h4>Late data<a name="Late_data"></a></h4>
 <p>Late data handling defines how the late data should be handled. Each feed is defined with a late cut-off value which specifies the time till which late data is valid. For example, late cut-off of hours(6) means that data for nth hour can get delayed by upto 6 hours. Late data specification in process defines how this late data is handled.</p>
 <p>Late data policy defines how frequently check is done to detect late data. The policies supported are: backoff, exp-backoff(exponention backoff) and final(at feed's late cut-off). The policy along with delay defines the interval at which late data check is done.</p>
 <p>Late input specification for each input defines the workflow that should run when late data is detected for that input.</p>
@@ -938,7 +982,16 @@ catalog:$database-name:$table-name#parti
 
 </pre></div>
 <p>This late handling specifies that late data detection should run at feed's late cut-off which is 6 hours in this case. If there is late data, Falcon should run the workflow specified at /projects/bootcamp/workflow/lateinput1/workflow.xml</p>
-<p><b>Note:</b> This is only supported for <a href="./FileSystem.html">FileSystem</a> storage but not Table storage at this point.</p></div>
+<p><b>Note:</b> This is only supported for FileSystem storage but not Table storage at this point.</p></div>
+<div class="section">
+<h4>ACL<a name="ACL"></a></h4>
+<p>A process has ACL (Access Control List) useful for implementing permission requirements and provide a way to set different permissions for specific users or named groups.</p>
+<div class="source">
+<pre>
+    &lt;ACL owner=&quot;test-user&quot; group=&quot;test-group&quot; permission=&quot;*&quot;/&gt;
+
+</pre></div>
+<p>ACL indicates the Access control list for this cluster. owner is the Owner of this entity. group is the one which has access to read. permission indicates the permission.</p></div>
                   </div>
           </div>
 

Modified: incubator/falcon/site/docs/FalconArchitecture.html
URL: http://svn.apache.org/viewvc/incubator/falcon/site/docs/FalconArchitecture.html?rev=1624488&r1=1624487&r2=1624488&view=diff
==============================================================================
--- incubator/falcon/site/docs/FalconArchitecture.html (original)
+++ incubator/falcon/site/docs/FalconArchitecture.html Fri Sep 12 09:43:48 2014
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia at 2014-07-05
+ | Generated by Apache Maven Doxia at 2014-09-12
  | Rendered using Apache Maven Fluido Skin 1.3.0
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20140705" />
+    <meta name="Date-Revision-yyyymmdd" content="20140912" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Falcon - Contents</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
@@ -239,7 +239,7 @@
         
                 
                     
-                  <li id="publishDate" class="pull-right">Last Published: 2014-07-05</li> 
+                  <li id="publishDate" class="pull-right">Last Published: 2014-09-12</li> 
             
                             </ul>
       </div>
@@ -264,18 +264,20 @@
 <li><a href="#Handling_late_input_data">Handling late input data</a></li>
 <li><a href="#Idempotency">Idempotency</a></li>
 <li><a href="#Alerting_and_Monitoring">Alerting and Monitoring</a></li>
-<li><a href="#Falcon_EL_Expressions">Falcon EL Expressions</a></li></ul></div>
+<li><a href="#Falcon_EL_Expressions">Falcon EL Expressions</a></li>
+<li><a href="#Lineage">Lineage</a></li>
+<li><a href="#Security">Security</a></li></ul></div>
 <div class="section">
 <h3>Architecture<a name="Architecture"></a></h3></div>
 <div class="section">
 <h4>Introduction<a name="Introduction"></a></h4>
-<p>Falcon is a feed and process management platform over hadoop. Falcon essentially transforms user's feed and process configurations into repeated actions through a standard workflow engine (Apache Oozie). Falcon by itself doesn't do any heavy lifting. All the functions and workflow state management requirements are delegated to the workflow scheduler. The only thing that Falcon maintains is the dependencies and relationship between these entities. This is adequate to provide integrated and seamless experience to the developers using the falcon platform.</p></div>
+<p>Falcon is a feed and process management platform over hadoop. Falcon essentially transforms user's feed and process configurations into repeated actions through a standard workflow engine. Falcon by itself doesn't do any heavy lifting. All the functions and workflow state management requirements are delegated to the workflow scheduler. The only thing that Falcon maintains is the dependencies and relationship between these entities. This is adequate to provide integrated and seamless experience to the developers using the falcon platform.</p></div>
 <div class="section">
 <h4>Falcon Architecture - Overview<a name="Falcon_Architecture_-_Overview"></a></h4>
 <p><img src="../images/Architecture.png" alt="" /></p></div>
 <div class="section">
 <h4>Scheduler<a name="Scheduler"></a></h4>
-<p>Falcon system has picked Apache Oozie as the default scheduler. However the system is open for integration with other schedulers. Lot of the data processing in hadoop requires scheduling to be based on both data availability as well as time. Apache Oozie currently supports these capabilities off the shelf and hence the choice.</p></div>
+<p>Falcon system has picked Oozie as the default scheduler. However the system is open for integration with other schedulers. Lot of the data processing in hadoop requires scheduling to be based on both data availability as well as time. Oozie currently supports these capabilities off the shelf and hence the choice.</p></div>
 <div class="section">
 <h4>Control flow<a name="Control_flow"></a></h4>
 <p>Though the actual responsibility of the workflow is with the scheduler (Oozie), Falcon remains in the execution path, by subscribing to messages that each of the workflow may generate. When Falcon generates a workflow in Oozie, it does so, after instrumenting the workflow with additional steps which includes messaging via JMS. Falcon system itself subscribes to these control messages and can perform actions such as retries, handling late input arrival etc.</p></div>
@@ -293,10 +295,12 @@
 <p>Stand alone mode is useful when the hadoop jobs and relevant data processing involves only one hadoop cluster. In this mode there is single Falcon server that contacts with oozie to schedule jobs on Hadoop. All the process / feed request like submit, schedule, suspend, kill are sent to this server only. For running in this mode one should use the falcon which has been built for standalone mode, or build using standalone option if using source code.</p></div>
 <div class="section">
 <h4>Distributed Mode<a name="Distributed_Mode"></a></h4>
-<p>Distributed mode is the mode which you might me using most of the time. This is for orgs which have multiple instances of hadoop clusters, and multiple workflow schedulers to handle them. Here we have 2 components: Prism and Server. Both Prism and server have there own setup (runtime and startup properties) and there config locations.  In this mode Prism acts as a contact point for Falcon servers. Below are the requests that can be sent to prism and server in this mode:</p>
+<p>Distributed mode is the mode which you might me using most of the time. This is for organisations which have multiple instances of hadoop clusters, and multiple workflow schedulers to handle them. Here we have 2 components: Prism and Server. Both Prism and server have there own setup (runtime and startup properties) and there config locations. In this mode Prism acts as a contact point for Falcon servers. Below are the requests that can be sent to prism and server in this mode:</p>
 <p>Prism: submit, schedule, submitAndSchedule, Suspend, Resume, Kill, instance management  Server: schedule, suspend, resume, instance management</p>
 <p>As observed above submit and kill are kept exclusively as Prism operations to keep all the config stores in sync and to support feature of idempotency. Request may also be sent from prism but directed to a specific server using the option &quot;-colo&quot; from CLI or append the same in web request, if using API.</p>
-<p>When a cluster is submitted it is by default sent to all the servers configured in the prism. When is feed is SUBMIT / SCHEDULED request is only sent to the servers specified in the feed / process definitions. Servers are mentioned in the feed / process via CLUSTER tags in xml definition.</p></div>
+<p>When a cluster is submitted it is by default sent to all the servers configured in the prism. When is feed is SUBMIT / SCHEDULED request is only sent to the servers specified in the feed / process definitions. Servers are mentioned in the feed / process via CLUSTER tags in xml definition.</p>
+<p>Communication between prism and falcon server (for submit/update entity function) is secured over <a class="externalLink" href="https://">https://</a> using a client-certificate based auth. Prism server needs to present a valid client certificate for the falcon server to accept the action.</p>
+<p>Startup property file in both falcon &amp; prism server need to be configured with the following configuration if TLS is enabled. * keystore.file * keystore.password</p></div>
 <div class="section">
 <h5>Prism Setup<a name="Prism_Setup"></a></h5>
 <p><img src="../images/PrismSetup.png" alt="" /></p></div>
@@ -307,7 +311,48 @@
 <h4>Atomic Actions<a name="Atomic_Actions"></a></h4>
 <p>Often times when Falcon performs entity management actions, it may need to do several individual actions. If one of the action were to fail, then the system could be in an inconsistent state. To avoid this, all individual operations performed are recorded into a transaction journal. This journal is then used to undo the overall user action. In some cases, it is not possible to undo the action. In such cases, Falcon attempts to keep the system in an consistent state.</p></div>
 <div class="section">
-<h3>Entity Management actions<a name="Entity_Management_actions"></a></h3></div>
+<h4>Storage<a name="Storage"></a></h4>
+<p>Falcon introduces a new abstraction to encapsulate the storage for a given feed which can either be expressed as a path on the file system, File System Storage or a table in a catalog such as Hive, Catalog Storage.</p>
+<div class="source">
+<pre>
+    &lt;xs:choice minOccurs=&quot;1&quot; maxOccurs=&quot;1&quot;&gt;
+        &lt;xs:element type=&quot;locations&quot; name=&quot;locations&quot;/&gt;
+        &lt;xs:element type=&quot;catalog-table&quot; name=&quot;table&quot;/&gt;
+    &lt;/xs:choice&gt;
+
+</pre></div>
+<p>Feed should contain one of the two storage options. Locations on File System or Table in a Catalog.</p></div>
+<div class="section">
+<h5>File System Storage<a name="File_System_Storage"></a></h5>
+<p>This is expressed as a location on the file system. Location specifies where the feed is available on this cluster. A location tag specifies the type of location like data, meta, stats and the corresponding paths for them. A feed should at least define the location for type data, which specifies the HDFS path pattern where the feed is generated periodically. ex: type=&quot;data&quot; path=&quot;/projects/TrafficHourly/${YEAR}-${MONTH}-${DAY}/traffic&quot; The granularity of date pattern in the path should be at least that of a frequency of a feed.</p>
+<div class="source">
+<pre>
+ &lt;location type=&quot;data&quot; path=&quot;/projects/falcon/clicks&quot; /&gt;
+ &lt;location type=&quot;stats&quot; path=&quot;/projects/falcon/clicksStats&quot; /&gt;
+ &lt;location type=&quot;meta&quot; path=&quot;/projects/falcon/clicksMetaData&quot; /&gt;
+
+</pre></div></div>
+<div class="section">
+<h5>Catalog Storage (Table)<a name="Catalog_Storage_Table"></a></h5>
+<p>A table tag specifies the table URI in the catalog registry as:</p>
+<div class="source">
+<pre>
+catalog:$database-name:$table-name#partition-key=partition-value);partition-key=partition-value);*
+
+</pre></div>
+<p>This is modeled as a URI (similar to an ISBN URI). It does not have any reference to Hive or HCatalog. Its quite generic so it can be tied to other implementations of a catalog registry. The catalog implementation specified in the startup config provides implementation for the catalog URI.</p>
+<p>Top-level partition has to be a dated pattern and the granularity of date pattern should be at least that of a frequency of a feed.</p>
+<p>Examples:</p>
+<div class="source">
+<pre>
+&lt;table uri=&quot;catalog:default:clicks#ds=${YEAR}-${MONTH}-${DAY}-${HOUR};region=${region}&quot; /&gt;
+&lt;table uri=&quot;catalog:src_demo_db:customer_raw#ds=${YEAR}-${MONTH}-${DAY}-${HOUR}&quot; /&gt;
+&lt;table uri=&quot;catalog:tgt_demo_db:customer_bcp#ds=${YEAR}-${MONTH}-${DAY}-${HOUR}&quot; /&gt;
+
+</pre></div></div>
+<div class="section">
+<h3>Entity Management actions<a name="Entity_Management_actions"></a></h3>
+<p>All the following operation can also be done using <a href="./Restapi/ResourceList.html">Falcon's RESTful API</a>.</p></div>
 <div class="section">
 <h4>Submit<a name="Submit"></a></h4>
 <p>Entity submit action allows a new cluster/feed/process to be setup within Falcon. Submitted entity is not scheduled, meaning it would simply be in the configuration store within Falcon. Besides validating against the schema for the corresponding entity being added, the Falcon system would also perform inter-field validations within the configuration file and validations across dependent entities.</p></div>
@@ -319,7 +364,8 @@
 <p>Returns the dependencies of the requested entity. Dependency list include both forward and backward dependencies (depends on &amp; is dependent on). For example, a feed would show process that are dependent on the feed and the clusters that it depends on.</p></div>
 <div class="section">
 <h4>Schedule<a name="Schedule"></a></h4>
-<p>Feeds or Processes that are already submitted and present in the config store can be scheduled. Upon schedule, Falcon system wraps the required repeatable action as a bundle of oozie coordinators and executes them on the Oozie scheduler. (It is possible to extend Falcon to use an alternate workflow engine other than Oozie). Falcon overrides the workflow instance's external id in Oozie to reflect the process/feed and the nominal time. This external Id can then be used for instance management functions.</p></div>
+<p>Feeds or Processes that are already submitted and present in the config store can be scheduled. Upon schedule, Falcon system wraps the required repeatable action as a bundle of oozie coordinators and executes them on the Oozie scheduler. (It is possible to extend Falcon to use an alternate workflow engine other than Oozie). Falcon overrides the workflow instance's external id in Oozie to reflect the process/feed and the nominal time. This external Id can then be used for instance management functions.</p>
+<p>The schedule copies the user specified workflow and library to a staging path, and the scheduler references the workflow and lib from the staging path.</p></div>
 <div class="section">
 <h4>Suspend<a name="Suspend"></a></h4>
 <p>This action is applicable only on scheduled entity. This triggers suspend on the oozie bundle that was scheduled earlier through the schedule function. No further instances are executed on a suspended process/feed.</p></div>
@@ -337,13 +383,13 @@
 <p>Delete operation on the entity removes any scheduled activity on the workflow engine, besides removing the entity from the falcon configuration store. Delete operation on an entity would only succeed if there are no dependent entities on the deleted entity.</p></div>
 <div class="section">
 <h4>Update<a name="Update"></a></h4>
-<p>Update operation allows an already submitted/scheduled entity to be updated. Cluster update is currently not allowed. Feed update can cause cascading update to all the processes already scheduled. The following set of actions are performed in Oozie to realize an update.</p>
-<p></p>
+<p>Update operation allows an already submitted/scheduled entity to be updated. Cluster update is currently not allowed. Feed update can cause cascading update to all the processes already scheduled. Process update triggers update in falcon if entity is updated/the user specified workflow/lib is updated. The following set of actions are performed in Oozie to realize an update:</p>
 <ul>
-<li>Suspend the previously scheduled Oozie coordinator. This is prevent any new action from being triggered.</li>
+<li>Suspend the previously scheduled Oozie coordinator. This is to prevent any new action from being triggered.</li>
 <li>Update the coordinator to set the end time to &quot;now&quot;</li>
 <li>Resume the suspended coordinators</li>
-<li>Schedule as per the new process/feed definition with the start time as &quot;now&quot;</li></ul></div>
+<li>Schedule as per the new process/feed definition with the start time as &quot;now&quot;</li></ul>
+<p>Update optionally takes effective time as a parameter which is used as the end time of previously scheduled coordinator. So, the updated configuration will be effective since the given timestamp.</p></div>
 <div class="section">
 <h3>Instance Management actions<a name="Instance_Management_actions"></a></h3>
 <p>Instance Manager gives user the option to control individual instances of the process based on their instance start time (start time of that instance). Start time needs to be given in standard TZ format. Example: 01 Jan 2012 01:00 =&gt; 2012-01-01T01:00Z</p>
@@ -368,10 +414,13 @@
 <ul>
 <li>5.	<b>resume</b>: -resume option is used to resume any instance that is in suspended state. (Note: due to a bug in oozie &#xef;&#xbf;&#xbd;resume option in some cases may not actually resume the suspended instance/ instances)</li>
 <li>6. <b>kill</b>: -kill option can be used to kill an instance or multiple instances</li></ul>
+<p></p>
+<ul>
+<li>7. <b>summary</b>: -summary option via CLI can be used to get the consolidated status of the instances between the specified time period. Each status along with the corresponding instance count are listed for each of the applicable colos.</li></ul>
 <p>In all the cases where your request is syntactically correct but logically not, the instance / instances are returned with the same status as earlier. Example: trying to resume a KILLED / SUCCEEDED instance will return the instance with KILLED / SUCCEEDED, without actually performing any operation. This is so because only an instance in SUSPENDED state can be resumed. Same thing is valid for rerun a SUSPENDED or RUNNING options etc.</p></div>
 <div class="section">
 <h3>Retention<a name="Retention"></a></h3>
-<p>In coherence with it's feed lifecycle management philosophy, Falcon allows the user to retain data in the system for a specific period of time for a scheduled feed. The user can specify the retention period in the respective  feed/data xml in the following manner for each cluster the feed can belong to :</p>
+<p>In coherence with it's feed lifecycle management philosophy, Falcon allows the user to retain data in the system for a specific period of time for a scheduled feed. The user can specify the retention period in the respective feed/data xml in the following manner for each cluster the feed can belong to :</p>
 <div class="source">
 <pre>
 &lt;clusters&gt;
@@ -383,7 +432,8 @@
  &lt;/clusters&gt; 
 
 </pre></div>
-<p>The 'limit' attribute can be specified in units of minutes/hours/days/months, and a corresponding numeric value can be attached to it. It essentially instructs the system to retain data spanning from the current moment to the time specified in the attribute spanning backwards in time. Any data beyond the limit (past/future) is erased from the system.</p></div>
+<p>The 'limit' attribute can be specified in units of minutes/hours/days/months, and a corresponding numeric value can be attached to it. It essentially instructs the system to retain data spanning from the current moment to the time specified in the attribute spanning backwards in time. Any data beyond the limit (past/future) is erased from the system.</p>
+<p>With the integration of Hive, Falcon also provides retention for tables in Hive catalog.</p></div>
 <div class="section">
 <h4>Example:<a name="Example:"></a></h4>
 <p>If retention period is 10 hours, and the policy kicks in at time 't', the data retained by system is essentially the one falling in between [t-10h,t]. Any data in the boundaries [-&#xef;&#xbf;&#xbd;,t-10h) and (t,&#xef;&#xbf;&#xbd;] is removed from the system.</p>
@@ -423,7 +473,7 @@
 <p>Replication can be scheduled with the past date, the time frame considered for replication is the minimum overlapping window of start and end time of source and target cluster, ex: if s1 and e1 is the start and end time of source cluster respectively, and s2 and e2 of target cluster, then the coordinator is scheduled in target cluster with start time max(s1,s2) and min(e1,e2).</p>
 <p>A feed can also optionally specify the delay for replication instance in the cluster tag, the delay governs the replication instance delays. If the frequency of the feed is hours(2) and delay is hours(1), then the replication instance will run every 2 hours and replicates data with an offset of 1 hour, i.e. at 09:00 UTC, feed instance which is eligible for replication is 08:00; and 11:00 UTC, feed instance of 10:00 UTC is eligible and so on.</p></div>
 <div class="section">
-<h4>Where is the feed path defined?<a name="Where_is_the_feed_path_defined"></a></h4>
+<h4>Where is the feed path defined for File System Storage?<a name="Where_is_the_feed_path_defined_for_File_System_Storage"></a></h4>
 <p>It's defined in the feed xml within the location tag.</p>
 <p><b>Example:</b></p>
 <div class="source">
@@ -452,6 +502,18 @@
 
 </pre></div></div>
 <div class="section">
+<h4>Hive Table Replication<a name="Hive_Table_Replication"></a></h4>
+<p>With the integration of Hive, Falcon adds table replication of Hive catalog tables. Replication will be triggered for a partition when the partition is complete at the source.</p>
+<p></p>
+<ul>
+<li>Falcon will use HCatalog (Hive) API to export the data for a given table and the partition,</li></ul>which will result in a data collection that includes metadata on the data's storage format, the schema, how the data is sorted, what table the data came from, and values of any partition keys from that table.
+<ul>
+<li>Falcon will use discp tool to copy the exported data collection into the secondary cluster into a staging</li></ul>directory used by Falcon.
+<ul>
+<li>Falcon will then import the data into HCatalog (Hive) using the HCatalog (Hive) API. If the specified table does</li></ul>not yet exist, Falcon will create it, using the information in the imported metadata to set defaults for the table such as schema, storage format, etc.
+<ul>
+<li>The partition is not complete and hence not visible to users until all the data is committed on the secondary</li></ul>cluster, (no dirty reads)</div>
+<div class="section">
 <h4>Relation between feed's retention limit and feed's late arrival cut off period:<a name="Relation_between_feeds_retention_limit_and_feeds_late_arrival_cut_off_period:"></a></h4>
 <p>For reasons that are obvious, Falcon has an external validation that ensures that the user always specifies the feed retention limit to be more than the feed's allowed late arrival period. If this rule is violated by the user, the feed submission call itself throws back an error.</p></div>
 <div class="section">
@@ -531,17 +593,17 @@ feed=&quot;raaw-logs16&quot; name=&quot;
 <p><b>Feed xml:</b></p>
 <div class="source">
 <pre>
-&lt;feed description=&quot;clicks log&quot; name=&quot;raaw-logs16&quot;....
+&lt;feed description=&quot;clicks log&quot; name=&quot;raw-logs16&quot;....
 
 </pre></div>
-<p>* The time interpretation for corresponding tags indicating the start and end instances for a particular input feed in the process xml should lie well within the timespan of the period specified in &lt;validity&gt; tag of the particular feed.</p>
+<p>* The time interpretation for corresponding tags indicating the start and end instances for a particular input feed in the process xml should lie well within the time span of the period specified in &lt;validity&gt; tag of the particular feed.</p>
 <p><b>Example:</b></p>
 <p>1. In the following scenario, process submission will result in an error:</p>
 <p><b>Process XML:</b></p>
 <div class="source">
 <pre>
 &lt;input end-instance=&quot;now(0,20)&quot; start-instance=&quot;now(0,-60)&quot;
-   feed=&quot;raaw-logs16&quot; name=&quot;inputData&quot;/&gt;
+   feed=&quot;raw-logs16&quot; name=&quot;inputData&quot;/&gt;
 
 </pre></div>
 <p><b>Feed XML:</b></p>
@@ -573,7 +635,7 @@ validity start=&quot;2009-01-01T00:00Z&q
 <p>Any changes in feed/process can be done by updating its definition. After the update, any new workflows which are to be scheduled after the update call will pick up the new changes. Feed/process name and start time can't be updated. Updating a process triggers updates to the workflow that is triggered in the workflow engine. Updating feed updates feed workflows like retention, replication etc. and also updates the processes that reference the feed.</p></div>
 <div class="section">
 <h3>Handling late input data<a name="Handling_late_input_data"></a></h3>
-<p>Falcon system can handle late arrival of input data and appropriately re-trigger processing for the affected instance. From the perspective of late handling, there are two main configuration parameters late-arrival cut-off and late-inputs section in feed and process entity definition that are central. These configurations govern how and when the late processing happens. In the current implementation (oozie based) the late handling is very simple and basic. The falcon system looks at all dependent input feeds for a process and computes the max late cut-off period. Then it uses a scheduled messaging framework, like the one available in Apache ActiveMQ or Java's <a href="./DelayQueue.html">DelayQueue</a> to schedule a message with a cut-off period, then after a cut-off period the message is dequeued and Falcon checks for changes in the feed data which is recorded in HDFS in latedata file by falcons &quot;record-size&quot; action, if it detects any changes then the workflow will be r
 erun with the new set of feed data.</p>
+<p>Falcon system can handle late arrival of input data and appropriately re-trigger processing for the affected instance. From the perspective of late handling, there are two main configuration parameters late-arrival cut-off and late-inputs section in feed and process entity definition that are central. These configurations govern how and when the late processing happens. In the current implementation (oozie based) the late handling is very simple and basic. The falcon system looks at all dependent input feeds for a process and computes the max late cut-off period. Then it uses a scheduled messaging framework, like the one available in Apache ActiveMQ or Java's DelayQueue to schedule a message with a cut-off period, then after a cut-off period the message is dequeued and Falcon checks for changes in the feed data which is recorded in HDFS in latedata file by falcons &quot;record-size&quot; action, if it detects any changes then the workflow will be rerun with the new set of feed da
 ta.</p>
 <p><b>Example:</b> The late rerun policy can be configured in the process definition. Falcon supports 3 policies, periodic, exp-backoff and final. Delay specifies, how often the feed data should be checked for changes, also one needs to  explicitly set the feed names in late-input which needs to be checked for late data.</p>
 <div class="source">
 <pre>
@@ -582,7 +644,8 @@ validity start=&quot;2009-01-01T00:00Z&q
         &lt;late-input input=&quot;clicks&quot; workflow-path=&quot;hdfs://clicks/late/workflow&quot; /&gt;
    &lt;/late-process&gt;
 
-</pre></div></div>
+</pre></div>
+<p><b>NOTE:</b> Feeds configured with table storage does not support late input data handling at this point. This will be made available in the near future.</p></div>
 <div class="section">
 <h3>Idempotency<a name="Idempotency"></a></h3>
 <p>All the operations in Falcon are Idempotent. That is if you make same request to the falcon server / prism again you will get a SUCCESSFUL return if it was SUCCESSFUL in the first attempt. For example, you submit a new process / feed and get SUCCESSFUL message return. Now if you run the same command / api request on same entity you will again get a SUCCESSFUL message. Same is true for other operations like schedule, kill, suspend and resume. Idempotency also by takes care of the condition when request is sent through prism and fails on one or more servers. For example prism is configured to send request to 3 servers. First user sends a request to SUBMIT a process on all 3 of them, and receives a response SUCCESSFUL from all of them. Then due to some issue one of the servers goes down, and user send a request to schedule the submitted process. This time he will receive a response with PARTIAL status and a FAILURE message from the server that has gone down. If the users check he wi
 ll find the process would have been started and running on the 2 SUCCESSFUL servers. Now the issue with server is figured out and it is brought up. Sending the SCHEDULE request again through prism will result in a SUCCESSFUL response from prism as well as other three servers, but this time PROCESS will be SCHEDULED only on the server which had failed earlier and other two will keep running as before.</p></div>
@@ -628,15 +691,15 @@ validity start=&quot;2009-01-01T00:00Z&q
 <li>Action - Name of the event.</li>
 <li>Dimensions - A list of name/value pairs of various attributes for a given action.</li>
 <li>Status- Status of an action FAILED/SUCCEEDED.</li>
-<li>Time-taken - Time taken in nano seconds for a given action.</li></ol>
+<li>Time-taken - Time taken in nanoseconds for a given action.</li></ol>
 <p>An example for an event logged for a submit of a new process definition:</p>
 <p>2012-05-04 12:23:34,026 {Action:submit, Dimensions:{entityType=process}, Status: SUCCEEDED, Time-taken:97087000 ns}</p>
 <p>Users may parse the metric.log or capture these events from custom monitoring frameworks and can plot various graphs  or send alerts according to their requirements.</p></div>
 <div class="section">
 <h4>Notifications<a name="Notifications"></a></h4>
 <p>Falcon creates a JMS topic for every process/feed that is scheduled in Falcon. The implementation class and the broker url of the JMS engine are read from the dependent cluster's definition. Users may register consumers on the required topic to check the availability or status of feed instances.</p>
-<p>For a given process that is scheduled, the name of the topic is same as the process name. Falcon sends a Map message for every feed produced by the instance of a process to the JMS topic. The JMS <a href="./MapMessage.html">MapMessage</a> sent to a topic has the following properties: entityName, feedNames, feedInstancePath, workflowId, runId, nominalTime, timeStamp, brokerUrl, brokerImplClass, entityType, operation, logFile, topicName, status, brokerTTL;</p>
-<p>For a given feed that is scheduled, the name of the topic is same as the feed name. Falcon sends a map message for every feed instance that is deleted/archived/replicated depending upon the retention policy set in the feed definition. The JMS <a href="./MapMessage.html">MapMessage</a> sent to a topic has the following properties: entityName, feedNames, feedInstancePath, workflowId, runId, nominalTime, timeStamp, brokerUrl, brokerImplClass, entityType, operation, logFile, topicName, status, brokerTTL;</p>
+<p>For a given process that is scheduled, the name of the topic is same as the process name. Falcon sends a Map message for every feed produced by the instance of a process to the JMS topic. The JMS MapMessage sent to a topic has the following properties: entityName, feedNames, feedInstancePath, workflowId, runId, nominalTime, timeStamp, brokerUrl, brokerImplClass, entityType, operation, logFile, topicName, status, brokerTTL;</p>
+<p>For a given feed that is scheduled, the name of the topic is same as the feed name. Falcon sends a map message for every feed instance that is deleted/archived/replicated depending upon the retention policy set in the feed definition. The JMS MapMessage sent to a topic has the following properties: entityName, feedNames, feedInstancePath, workflowId, runId, nominalTime, timeStamp, brokerUrl, brokerImplClass, entityType, operation, logFile, topicName, status, brokerTTL;</p>
 <p>The JMS messages are automatically purged after a certain period (default 3 days) by the Falcon JMS house-keeping service.TTL (Time-to-live) for JMS message can be configured in the Falcon's startup.properties file.</p></div>
 <div class="section">
 <h3>Falcon EL Expressions<a name="Falcon_EL_Expressions"></a></h3>
@@ -690,19 +753,61 @@ validity start=&quot;2009-01-01T00:00Z&q
 <li>3.	<b>yesterday(hours,minutes)</b>: As the name suggest EL yesterday picks up feed instances with respect to start of day yesterday. Hours and minutes are added to the 00 hours starting yesterday, Example: yesterday(24,30) will actually correspond to 00:30 am of today, for 2010-01-02T01:30Z this would mean 2010-01-02:00:30 feed.</li></ul>
 <p></p>
 <ul>
+<li>7.	<b>lastYear(month,day,hour,minute)</b>: This is exactly similarly to currentYear in usage&gt; only difference being start reference is taken to start of previous year. For example: lastYear(4,2,2,20) will correspond to feed instance created at 2009-05-03T02:20Z and lastYear(12,2,2,20) will correspond to feed at 2010-01-03T02:20Z.</li></ul>
+<p></p>
+<ul>
 <li>4.	<b>currentMonth(day,hour,minute)</b>: Current month takes the reference to start of the month with respect to instance start time. One thing to keep in mind is that day is added to the first day of the month. So the value of day is the number of days you want to add to the first day of the month. For example: for instance start time 2010-01-12T01:30Z and El as currentMonth(3,2,40) will correspond to feed created at 2010-01-04T02:40Z and currentMonth(0,0,0) will mean 2010-01-01T00:00Z.</li></ul>
 <p></p>
 <ul>
 <li>5.	<b>lastMonth(day,hour,minute)</b>: Parameters for lastMonth is same as currentMonth, only difference being the reference is shifted to one month back. For instance start 2010-01-12T01:30Z lastMonth(2,3,30) will correspond to feed instance at 2009-12-03:T03:30Z</li></ul>
 <p></p>
 <ul>
-<li>6.	<b>currentYear(month,day,hour,minute)</b>: The month,day,hour, minutes in the pareamter are added with reference to the start of year of instance start time. For our exmple start time 2010-01-02:00:30 reference will go back to 2010-01-01:T00:00Z. Also similar to days, months are added to the 1st month that Jan. So currentYear(0,2,2,20) will mean 2010-01-03T02:20Z while currentYear(11,2,2,20) will mean 2010-12-03T02:20Z</li></ul>
+<li>6.	<b>currentYear(month,day,hour,minute)</b>: The month,day,hour, minutes in the parameter are added with reference to the start of year of instance start time. For our example start time 2010-01-02:00:30 reference will go back to 2010-01-01:T00:00Z. Also similar to days, months are added to the 1st month that Jan. So currentYear(0,2,2,20) will mean 2010-01-03T02:20Z while currentYear(11,2,2,20) will mean 2010-12-03T02:20Z</li></ul>
+<p></p>
+<ul>
+<li>7.	<b>lastYear(month,day,hour,minute)</b>: This is exactly similarly to currentYear in usage&gt; only difference being start reference is taken to start of previous year. For example: lastYear(4,2,2,20) will corrospond to feed insatnce created at 2009-05-03T02:20Z and lastYear(12,2,2,20) will corrospond to feed at 2010-01-03T02:20Z.</li></ul>
+<p></p>
+<ul>
+<li>8. <b>latest(number of latest instance)</b>: This will simply make you input consider the number of latest available instance of the feed given as parameter. For example: latest(0) will consider the last available instance of feed, where as latest latest(-1) will consider second last available feed and latest(-3) will consider 4th last available feed.</li></ul>
 <p></p>
 <ul>
-<li>7.	<b>lastYear(month,day,hour,minute)</b>: This is exactly similary to currentYear in usage&gt; only difference being start reference is taken to start of previous year. For example: lastYear(4,2,2,20) will corrospond to feed insatnce created at 2009-05-03T02:20Z and lastYear(12,2,2,20) will corrospond to feed at 2010-01-03T02:20Z.</li></ul>
+<li>9.	<b>currentWeek(weekDayName,hour,minute)</b>: This is similar to currentMonth in the sense that it returns a relative time with respect to the instance start time, considering the day name provided as input as the start of the week. The day names can be one of SUN, MON, TUE, WED, THU, FRI, SAT.</li></ul>
 <p></p>
 <ul>
-<li>8. <b>latest(number of latest instance)</b>: This will simply make you input consider the number of latest available instance of the feed given as parameter. For example: latest(0) will consider the last available instance of feed, where as latest latest(-1) will consider second last available feed and latest(-3) will consider 4th last available feed.</li></ul></div>
+<li>10. <b>lastWeek(weekDayName,hour,minute)</b>: This is typically 7 days less than what the currentWeek returns for similar parameters.</li></ul></div>
+<div class="section">
+<h3>Lineage<a name="Lineage"></a></h3>
+<p>Falcon adds the ability to capture lineage for both entities and its associated instances. It also captures the metadata tags associated with each of the entities as relationships. The following relationships are captured:</p>
+<p></p>
+<ul>
+<li>owner of entities - User</li>
+<li>data classification tags</li>
+<li>groups defined in feeds</li>
+<li>Relationships between entities
+<ul>
+<li>Clusters associated with Feed and Process entity</li>
+<li>Input and Output feeds for a Process</li></ul></li>
+<li>Instances refer to corresponding entities</li></ul>
+<p>Lineage is exposed in 3 ways:</p>
+<p></p>
+<ul>
+<li>REST API</li>
+<li>CLI</li>
+<li>Dashboard - Interactive lineage for Process instances</li></ul>
+<p>This feature is enabled by default but could be disabled by removing the following from:</p>
+<div class="source">
+<pre>
+config name: *.application.services
+config value: org.apache.falcon.metadata.MetadataMappingService
+&lt;verbatim&gt;
+
+Lineage is only captured for Process executions. A future release will capture lineage for
+lifecycle policies such as replication and retention.
+
+--++ Security
+
+Security is detailed in [[Security][Security]].
+</pre></div></div>
                   </div>
           </div>
 

Modified: incubator/falcon/site/docs/FalconCLI.html
URL: http://svn.apache.org/viewvc/incubator/falcon/site/docs/FalconCLI.html?rev=1624488&r1=1624487&r2=1624488&view=diff
==============================================================================
--- incubator/falcon/site/docs/FalconCLI.html (original)
+++ incubator/falcon/site/docs/FalconCLI.html Fri Sep 12 09:43:48 2014
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia at 2014-07-05
+ | Generated by Apache Maven Doxia at 2014-09-12
  | Rendered using Apache Maven Fluido Skin 1.3.0
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20140705" />
+    <meta name="Date-Revision-yyyymmdd" content="20140912" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Falcon - FalconCLI</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
@@ -239,7 +239,7 @@
         
                 
                     
-                  <li id="publishDate" class="pull-right">Last Published: 2014-07-05</li> 
+                  <li id="publishDate" class="pull-right">Last Published: 2014-09-12</li> 
             
                             </ul>
       </div>
@@ -265,7 +265,7 @@
 <p>Example: $FALCON_HOME/bin/falcon entity  -type process -name sampleProcess -schedule</p></div>
 <div class="section">
 <h4>Suspend<a name="Suspend"></a></h4>
-<p>Suspend on an entity results in suspension of the oozie bundle that was scheduled earlier through the schedule function. No further instances are executed on a suspended entity. Only schedulable entities(process/feed) can be suspended.</p>
+<p>Suspend on an entity results in suspension of the oozie bundle that was scheduled earlier through the schedule function. No further instances are executed on a suspended entity. Only schedule-able entities(process/feed) can be suspended.</p>
 <p>Usage: $FALCON_HOME/bin/falcon entity  -type [feed|process] -name &lt;&lt;name&gt;&gt; -suspend</p></div>
 <div class="section">
 <h4>Resume<a name="Resume"></a></h4>
@@ -278,7 +278,15 @@
 <div class="section">
 <h4>List<a name="List"></a></h4>
 <p>Entities of a particular type can be listed with list sub-command.</p>
-<p>Usage: $FALCON_HOME/bin/falcon entity -type [cluster|feed|process] -list</p></div>
+<p>Usage: $FALCON_HOME/bin/falcon entity -type [cluster|feed|process] -list</p>
+<p>Optional Args : -fields &lt;&lt;field1,field2&gt;&gt; -filterBy &lt;&lt;field1:value1,field2:value2&gt;&gt; -tags &lt;&lt;tagkey=tagvalue,tagkey=tagvalue&gt;&gt; -orderBy &lt;&lt;field&gt;&gt; -sortOrder &lt;&lt;sortOrder&gt;&gt; -offset 0 -numResults 10</p>
+<p><a href="./Restapi/EntityList.html">Optional params described here.</a></p></div>
+<div class="section">
+<h4>Summary<a name="Summary"></a></h4>
+<p>Summary of entities of a particular type and a cluster will be listed. Entity summary has N most recent instances of entity.</p>
+<p>Usage: $FALCON_HOME/bin/falcon entity -type [cluster|feed|process] -summary</p>
+<p>Optional Args : -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -end &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -fields &lt;&lt;field1,field2&gt;&gt; -filterBy &lt;&lt;field1:value1,field2:value2&gt;&gt; -tags &lt;&lt;tagkey=tagvalue,tagkey=tagvalue&gt;&gt; -orderBy &lt;&lt;field&gt;&gt; -sortOrder &lt;&lt;sortOrder&gt;&gt; -offset 0 -numResults 10 -numInstances 7</p>
+<p><a href="./Restapi/EntitySummary.html">Optional params described here.</a></p></div>
 <div class="section">
 <h4>Update<a name="Update"></a></h4>
 <p>Update operation allows an already submitted/scheduled entity to be updated. Cluster update is currently not allowed.</p>
@@ -325,29 +333,78 @@
 <p>Status option via CLI can be used to get the status of a single or multiple instances.  If the instance is not yet materialized but is within the process validity range, WAITING is returned as the state. Along with the status of the instance time is also returned. Log location gives the oozie workflow url If the instance is in WAITING state, missing dependencies are listed</p>
 <p>Example : Suppose a process has 3 instance, one has succeeded,one is in running state and other one is waiting, the expected output is:</p>
 <p>{&quot;status&quot;:&quot;SUCCEEDED&quot;,&quot;message&quot;:&quot;getStatus is successful&quot;,&quot;instances&quot;:[{&quot;instance&quot;:&quot;2012-05-07T05:02Z&quot;,&quot;status&quot;:&quot;SUCCEEDED&quot;,&quot;logFile&quot;:&quot;http://oozie-dashboard-url&quot;},{&quot;instance&quot;:&quot;2012-05-07T05:07Z&quot;,&quot;status&quot;:&quot;RUNNING&quot;,&quot;logFile&quot;:&quot;http://oozie-dashboard-url&quot;}, {&quot;instance&quot;:&quot;2010-01-02T11:05Z&quot;,&quot;status&quot;:&quot;WAITING&quot;}]</p>
-<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -status -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -end &quot;yyyy-MM-dd'T'HH:mm'Z'&quot;</p></div>
+<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -status</p>
+<p>Optional Args : -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -end &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -colo &lt;&lt;colo&gt;&gt; -filterBy &lt;&lt;field1:value1,field2:value2&gt;&gt; -lifecycle &lt;&lt;lifecycles&gt;&gt; -orderBy field -sortOrder &lt;&lt;sortOrder&gt;&gt; -offset 0 -numResults 10</p>
+<p><a href="./Restapi/InstanceStatus.html"> Optional params described here.</a></p></div>
+<div class="section">
+<h4>List<a name="List"></a></h4>
+<p>List option via CLI can be used to get single or multiple instances.  If the instance is not yet materialized but is within the process validity range, WAITING is returned as the state. Instance time is also returned. Log location gives the oozie workflow url If the instance is in WAITING state, missing dependencies are listed</p>
+<p>Example : Suppose a process has 3 instance, one has succeeded,one is in running state and other one is waiting, the expected output is:</p>
+<p>{&quot;status&quot;:&quot;SUCCEEDED&quot;,&quot;message&quot;:&quot;getStatus is successful&quot;,&quot;instances&quot;:[{&quot;instance&quot;:&quot;2012-05-07T05:02Z&quot;,&quot;status&quot;:&quot;SUCCEEDED&quot;,&quot;logFile&quot;:&quot;http://oozie-dashboard-url&quot;},{&quot;instance&quot;:&quot;2012-05-07T05:07Z&quot;,&quot;status&quot;:&quot;RUNNING&quot;,&quot;logFile&quot;:&quot;http://oozie-dashboard-url&quot;}, {&quot;instance&quot;:&quot;2010-01-02T11:05Z&quot;,&quot;status&quot;:&quot;WAITING&quot;}]</p>
+<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -list</p>
+<p>Optional Args : -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -end &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -colo &lt;&lt;colo&gt;&gt; -lifecycle &lt;&lt;lifecycles&gt;&gt; -filterBy &lt;&lt;field1:value1,field2:value2&gt;&gt; -orderBy field -sortOrder &lt;&lt;sortOrder&gt;&gt; -offset 0 -numResults 10</p>
+<p><a href="./Restapi/InstanceList.html">Optional params described here.</a></p></div>
 <div class="section">
 <h4>Summary<a name="Summary"></a></h4>
 <p>Summary option via CLI can be used to get the consolidated status of the instances between the specified time period. Each status along with the corresponding instance count are listed for each of the applicable colos. The unscheduled instances between the specified time period are included as UNSCHEDULED in the output to provide more clarity.</p>
 <p>Example : Suppose a process has 3 instance, one has succeeded,one is in running state and other one is waiting, the expected output is:</p>
 <p>{&quot;status&quot;:&quot;SUCCEEDED&quot;,&quot;message&quot;:&quot;getSummary is successful&quot;, &quot;cluster&quot;: &lt;&lt;name&gt;&gt; [{&quot;SUCCEEDED&quot;:&quot;1&quot;}, {&quot;WAITING&quot;:&quot;1&quot;}, {&quot;RUNNING&quot;:&quot;1&quot;}]}</p>
-<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -summary -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -end &quot;yyyy-MM-dd'T'HH:mm'Z'&quot;</p></div>
+<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -summary</p>
+<p>Optional Args : -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -end &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -colo &lt;&lt;colo&gt;&gt; -lifecycle &lt;&lt;lifecycles&gt;&gt;</p>
+<p><a href="./Restapi/InstanceSummary.html">Optional params described here.</a></p></div>
 <div class="section">
 <h4>Running<a name="Running"></a></h4>
 <p>Running option provides all the running instances of the mentioned process.</p>
-<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -running</p></div>
+<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -running</p>
+<p>Optional Args : -colo &lt;&lt;colo&gt;&gt; -lifecycle &lt;&lt;lifecycles&gt;&gt; -filterBy &lt;&lt;field1:value1,field2:value2&gt;&gt; -orderBy &lt;&lt;field&gt;&gt; -sortOrder &lt;&lt;sortOrder&gt;&gt; -offset 0 -numResults 10</p>
+<p><a href="./Restapi/InstanceRunning.html">Optional params described here.</a></p></div>
 <div class="section">
 <h4>Logs<a name="Logs"></a></h4>
 <p>Get logs for instance actions</p>
-<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -logs -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; [-end &quot;yyyy-MM-dd'T'HH:mm'Z'&quot;] [-runid &lt;&lt;runid&gt;&gt;]</p></div>
+<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -logs</p>
+<p>Optional Args : -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -end &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -runid &lt;&lt;runid&gt;&gt; -colo &lt;&lt;colo&gt;&gt; -lifecycle &lt;&lt;lifecycles&gt;&gt; -filterBy &lt;&lt;field1:value1,field2:value2&gt;&gt; -orderBy field -sortOrder &lt;&lt;sortOrder&gt;&gt; -offset 0 -numResults 10</p>
+<p><a href="./Restapi/InstanceLogs.html">Optional params described here.</a></p></div>
+<div class="section">
+<h4>LifeCycle<a name="LifeCycle"></a></h4>
+<p>Describes list of life cycles of a entity , for feed it can be replication/retention and for process it can be execution. This can be used with instance management options. Default values are replication for feed and execution for process.</p>
+<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -status -lifecycle &lt;&lt;lifecycletype&gt;&gt; -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot; -end &quot;yyyy-MM-dd'T'HH:mm'Z'&quot;</p></div>
+<div class="section">
+<h4>Params<a name="Params"></a></h4>
+<p>Displays the workflow params of a given instance. Where start time is considered as nominal time of that instance.</p>
+<p>Usage: $FALCON_HOME/bin/falcon instance -type &lt;&lt;feed/process&gt;&gt; -name &lt;&lt;name&gt;&gt; -params -start &quot;yyyy-MM-dd'T'HH:mm'Z'&quot;</p></div>
+<div class="section">
+<h3>Graphs Options<a name="Graphs_Options"></a></h3></div>
+<div class="section">
+<h4>Vertex<a name="Vertex"></a></h4>
+<p>Get the vertex with the specified id.</p>
+<p>Usage: $FALCON_HOME/bin/falcon graph -vertex -id &lt;&lt;id&gt;&gt;</p>
+<p>Example: $FALCON_HOME/bin/falcon graph -vertex -id 4</p></div>
+<div class="section">
+<h4>Vertices<a name="Vertices"></a></h4>
+<p>Get all vertices for a key index given the specified value.</p>
+<p>Usage: $FALCON_HOME/bin/falcon graph -vertices -key &lt;&lt;key&gt;&gt; -value &lt;&lt;value&gt;&gt;</p>
+<p>Example: $FALCON_HOME/bin/falcon graph -vertices -key type -value feed-instance</p></div>
+<div class="section">
+<h4>Vertex Edges<a name="Vertex_Edges"></a></h4>
+<p>Get the adjacent vertices or edges of the vertex with the specified direction.</p>
+<p>Usage: $FALCON_HOME/bin/falcon graph -edges -id &lt;&lt;vertex-id&gt;&gt; -direction &lt;&lt;direction&gt;&gt;</p>
+<p>Example: $FALCON_HOME/bin/falcon graph -edges -id 4 -direction both $FALCON_HOME/bin/falcon graph -edges -id 4 -direction inE</p></div>
+<div class="section">
+<h4>Edge<a name="Edge"></a></h4>
+<p>Get the edge with the specified id.</p>
+<p>Usage: $FALCON_HOME/bin/falcon graph -edge -id &lt;&lt;id&gt;&gt;</p>
+<p>Example: $FALCON_HOME/bin/falcon graph -edge -id Q9n-Q-5g</p></div>
 <div class="section">
 <h3>Admin Options<a name="Admin_Options"></a></h3></div>
 <div class="section">
 <h4>Help<a name="Help"></a></h4>
-<p>Usage: $FALCON_HOME/bin/falcon admin -version</p></div>
+<p>Usage: $FALCON_HOME/bin/falcon admin -help</p></div>
 <div class="section">
 <h4>Version<a name="Version"></a></h4>
-<p>Version returns the current verion of Falcon installed. Usage: $FALCON_HOME/bin/falcon admin -help</p></div>
+<p>Version returns the current version of Falcon installed. Usage: $FALCON_HOME/bin/falcon admin -version</p></div>
+<div class="section">
+<h4>Status<a name="Status"></a></h4>
+<p>Status returns the current state of Falcon (running or stopped). Usage: $FALCON_HOME/bin/falcon admin -status</p></div>
                   </div>
           </div>
 

Modified: incubator/falcon/site/docs/HiveIntegration.html
URL: http://svn.apache.org/viewvc/incubator/falcon/site/docs/HiveIntegration.html?rev=1624488&r1=1624487&r2=1624488&view=diff
==============================================================================
--- incubator/falcon/site/docs/HiveIntegration.html (original)
+++ incubator/falcon/site/docs/HiveIntegration.html Fri Sep 12 09:43:48 2014
@@ -1,13 +1,13 @@
 <!DOCTYPE html>
 <!--
- | Generated by Apache Maven Doxia at 2014-07-05
+ | Generated by Apache Maven Doxia at 2014-09-12
  | Rendered using Apache Maven Fluido Skin 1.3.0
 -->
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <head>
     <meta charset="UTF-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <meta name="Date-Revision-yyyymmdd" content="20140705" />
+    <meta name="Date-Revision-yyyymmdd" content="20140912" />
     <meta http-equiv="Content-Language" content="en" />
     <title>Falcon - Hive Integration</title>
     <link rel="stylesheet" href="../css/apache-maven-fluido-1.3.0.min.css" />
@@ -239,7 +239,7 @@
         
                 
                     
-                  <li id="publishDate" class="pull-right">Last Published: 2014-07-05</li> 
+                  <li id="publishDate" class="pull-right">Last Published: 2014-09-12</li> 
             
                             </ul>
       </div>
@@ -316,7 +316,7 @@ catalog.service.impl=org.apache.falcon.c
 <ul>
 <li>Falcon will use HCatalog (Hive) API to export the data for a given table and the partition,</li></ul>which will result in a data collection that includes metadata on the data's storage format, the schema, how the data is sorted, what table the data came from, and values of any partition keys from that table.
 <ul>
-<li>Falcon will use <a href="./DistCp.html">DistCp</a> tool to copy the exported data collection into the secondary cluster into a staging</li></ul>directory used by Falcon.
+<li>Falcon will use discp tool to copy the exported data collection into the secondary cluster into a staging</li></ul>directory used by Falcon.
 <ul>
 <li>Falcon will then import the data into HCatalog (Hive) using the HCatalog (Hive) API. If the specified table does</li></ul>not yet exist, Falcon will create it, using the information in the imported metadata to set defaults for the table such as schema, storage format, etc.
 <ul>
@@ -350,11 +350,44 @@ catalog.service.impl=org.apache.falcon.c
 <pre>
 bin/hadoop dfs -copyFromLocal $LFS/share/lib/hcatalog/hcatalog-pig-adapter-0.5.0-incubating.jar share/lib/hcatalog
 
+</pre></div>
+<p></p>
+<ul>
+<li>Oozie 4.x with Hadoop-2.x</li></ul>Replication jobs are submitted to oozie on the destination cluster. Oozie runs a table export job on RM on source cluster. Oozie server on the target cluster must be configured with source hadoop configs else jobs fail with errors on secure and non-secure clusters as below:
+<div class="source">
+<pre>
+org.apache.hadoop.security.token.SecretManager$InvalidToken: Password not found for ApplicationAttempt appattempt_1395965672651_0010_000002
+
+</pre></div>
+<p>Make sure all oozie servers that falcon talks to has the hadoop configs configured in oozie-site.xml</p>
+<div class="source">
+<pre>
+&lt;property&gt;
+      &lt;name&gt;oozie.service.HadoopAccessorService.hadoop.configurations&lt;/name&gt;
+      &lt;value&gt;*=/etc/hadoop/conf,arpit-new-falcon-1.cs1cloud.internal:8020=/etc/hadoop-1,arpit-new-falcon-1.cs1cloud.internal:8032=/etc/hadoop-1,arpit-new-falcon-2.cs1cloud.internal:8020=/etc/hadoop-2,arpit-new-falcon-2.cs1cloud.internal:8032=/etc/hadoop-2,arpit-new-falcon-5.cs1cloud.internal:8020=/etc/hadoop-3,arpit-new-falcon-5.cs1cloud.internal:8032=/etc/hadoop-3&lt;/value&gt;
+      &lt;description&gt;
+          Comma separated AUTHORITY=HADOOP_CONF_DIR, where AUTHORITY is the HOST:PORT of
+          the Hadoop service (JobTracker, HDFS). The wildcard '*' configuration is
+          used when there is no exact match for an authority. The HADOOP_CONF_DIR contains
+          the relevant Hadoop *-site.xml files. If the path is relative is looked within
+          the Oozie configuration directory; though the path can be absolute (i.e. to point
+          to Hadoop client conf/ directories in the local filesystem.
+      &lt;/description&gt;
+    &lt;/property&gt;
+
 </pre></div></div>
 <div class="section">
 <h4>Hive<a name="Hive"></a></h4>
 <p></p>
 <ul>
+<li>Dated Partitions</li></ul>Falcon does not work well when table partition contains multiple dated columns. Falcon only works with a single dated partition. This is being tracked in FALCON-357 which is a limitation in Oozie.
+<div class="source">
+<pre>
+catalog:default:table4#year=${YEAR};month=${MONTH};day=${DAY};hour=${HOUR};minute=${MINUTE}
+
+</pre></div>
+<p></p>
+<ul>
 <li><a class="externalLink" href="https://issues.apache.org/jira/browse/HIVE-5550">Hive table import fails for tables created with default text and sequence file formats using HCatalog API</a></li></ul>For some arcane reason, hive substitutes the output format for text and sequence to be prefixed with Hive. Hive table import fails since it compares against the input and output formats of the source table and they are different. Say, a table was created with out specifying the file format, it defaults to:
 <div class="source">
 <pre>
@@ -377,7 +410,8 @@ org.apache.hadoop.hive.ql.parse.ImportSe
                 .getMsg(&quot; Table inputformat/outputformats do not match&quot;));
       }
 
-</pre></div></div>
+</pre></div>
+<p>The above is not an issue with Hive 0.13.</p></div>
 <div class="section">
 <h3>Hive Examples<a name="Hive_Examples"></a></h3>
 <p>Following is an example entity configuration for lifecycle management functions for tables in Hive.</p></div>



Mime
View raw message