falcon-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pall...@apache.org
Subject falcon git commit: FALCON-1562 Documentation for enabling native scheduler in falcon
Date Tue, 15 Dec 2015 12:19:05 GMT
Repository: falcon
Updated Branches:
  refs/heads/master 7f4ff1a37 -> 8be383ebe


FALCON-1562 Documentation for enabling native scheduler in falcon


Project: http://git-wip-us.apache.org/repos/asf/falcon/repo
Commit: http://git-wip-us.apache.org/repos/asf/falcon/commit/8be383eb
Tree: http://git-wip-us.apache.org/repos/asf/falcon/tree/8be383eb
Diff: http://git-wip-us.apache.org/repos/asf/falcon/diff/8be383eb

Branch: refs/heads/master
Commit: 8be383ebe43b37deb2f24fcbdb2e6e230c8deed9
Parents: 7f4ff1a
Author: Pallavi Rao <pallavi.rao@inmobi.com>
Authored: Tue Dec 15 17:48:32 2015 +0530
Committer: Pallavi Rao <pallavi.rao@inmobi.com>
Committed: Tue Dec 15 17:48:32 2015 +0530

----------------------------------------------------------------------
 CHANGES.txt                                     |   2 +
 docs/src/site/twiki/Configuration.twiki         |   4 +
 docs/src/site/twiki/FalconDocumentation.twiki   |   2 +
 docs/src/site/twiki/FalconNativeScheduler.twiki | 163 +++++++++++++++++++
 docs/src/site/twiki/falconcli/Schedule.twiki    |   4 +-
 src/conf/startup.properties                     |   8 +-
 6 files changed, 176 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/CHANGES.txt
----------------------------------------------------------------------
diff --git a/CHANGES.txt b/CHANGES.txt
index 86a5c9a..0f773de 100755
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -9,6 +9,8 @@ Trunk (Unreleased)
   INCOMPATIBLE CHANGES
 
   NEW FEATURES
+    FALCON-1562 Documentation for enabling native scheduler in falcon (Pallavi Rao)
+
     FALCON-1512 Implement touch feature for native scheduler (Pallavi Rao)
 
     FALCON-1233 Support co-existence of Oozie scheduler (coord) and Falcon native scheduler
(Pallavi Rao)

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/docs/src/site/twiki/Configuration.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/Configuration.twiki b/docs/src/site/twiki/Configuration.twiki
index 74da49a..0df094f 100644
--- a/docs/src/site/twiki/Configuration.twiki
+++ b/docs/src/site/twiki/Configuration.twiki
@@ -100,6 +100,10 @@ Some Falcon features such as late data handling, retries, metadata service,
depe
    * Copy notification related properties in oozie/conf/oozie-site.xml to oozie-site.xml
of the Oozie installation.  Restart Oozie so changes get reflected.  
 
 *NOTE : If you disable Falcon post-processing JMS notification and not enable Oozie JMS notification,
features such as failure retry, late data handling and metadata service will be disabled for
all entities on the server.*
+
+---+++Enabling Falcon Native Scheudler
+You can either choose to schedule entities using Oozie's coordinator or using Falcon's native
scheduler. To be able to schedule entities natively on Falcon, you will need to add some additional
properties to <verbatim>$FALCON_HOME/conf/startup.properties</verbatim> before
starting the Falcon Server. For details on the same, refer to [[FalconNativeScheduler][Falcon
Native Scheduler]]
+
 ---+++Adding Extension Libraries
 
 Library extensions allows users to add custom libraries to entity lifecycles such as feed
retention, feed replication

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/docs/src/site/twiki/FalconDocumentation.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/FalconDocumentation.twiki b/docs/src/site/twiki/FalconDocumentation.twiki
index f384a42..95f388a 100644
--- a/docs/src/site/twiki/FalconDocumentation.twiki
+++ b/docs/src/site/twiki/FalconDocumentation.twiki
@@ -37,6 +37,8 @@ Falcon system has picked Oozie as the default scheduler. However the system
is o
 other schedulers. Lot of the data processing in hadoop requires scheduling to be based on
both data availability
 as well as time. Oozie currently supports these capabilities off the shelf and hence the
choice.
 
+While the use of Oozie works reasonably well, there are scenarios where Oozie scheduling
is proving to be a limiting factor. In its current form, Falcon relies on Oozie for both scheduling
and for workflow execution, due to which the scheduling is limited to time based/cron based
scheduling with additional gating conditions on data availability. Also, this imposes restrictions
on datasets being periodic/cyclic in nature. In order to offer better scheduling capabilities,
Falcon comes with its own native scheduler. Refer to [[FalconNativeScheduler][Falcon Native
Scheduler]] for details.
+
 ---+++ Control flow
 Though the actual responsibility of the workflow is with the scheduler (Oozie), Falcon remains
in the
 execution path, by subscribing to messages that each of the workflow may generate. When Falcon
generates a

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/docs/src/site/twiki/FalconNativeScheduler.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/FalconNativeScheduler.twiki b/docs/src/site/twiki/FalconNativeScheduler.twiki
new file mode 100644
index 0000000..9403ae7
--- /dev/null
+++ b/docs/src/site/twiki/FalconNativeScheduler.twiki
@@ -0,0 +1,163 @@
+---+ Falcon Native Scheduler
+
+---++ Overview
+Falcon has been using Oozie as its scheduling engine.  While the use of Oozie works reasonably
well, there are scenarios where Oozie scheduling is proving to be a limiting factor. In its
current form, Falcon relies on Oozie for both scheduling and for workflow execution, due to
which the scheduling is limited to time based/cron based scheduling with additional gating
conditions on data availability. Also, this imposes restrictions on datasets being periodic
in nature. In order to offer better scheduling capabilities, Falcon comes with its own native
scheduler. 
+
+---++ Capabilities
+The native scheduler will offer the capabilities offered by Oozie co-ordinator and more.
The native scheduler will be built and released over the next few releases of Falcon giving
users an opportunity to use it and provide feedback.
+
+Currently, the native scheduler offers the following capabilities:
+   1. Submit and schedule a Falcon process that runs periodically (without data dependency)
- It could be a PIG script, oozie workflow, Hive (all the engine types currently supported).
+   1. Monitor/Query/Modify the scheduled process - All applicable entity APIs and instance
APIs should work as it does now.  Falcon provides data management functions for feeds declaratively.
It allows users to represent feed locations as time-based partition directories on HDFS containing
files.
+
+*NOTE: Execution order is FIFO. LIFO and LAST_ONLY are not supported yet.*
+
+In the near future, Falcon scheduler will provide feature parity with Oozie scheduler and
in subsequent releases will provide the following features:
+   * Periodic, cron-based, calendar-based scheduling.
+   * Data availability based scheduling.
+   * External trigger/notification based scheduling.
+   * Support for periodic/a-periodic datasets.
+   * Support for optional/mandatory datasets. Option to specify minumum/maximum/exactly-N
instances of data to consume.
+   * Handle dependencies across entities during re-run.
+
+---++ Configuring Native Scheduler
+You can enable native scheduler by making changes to __$FALCON_HOME/conf/startup.properties__
as follows. You will need to restart Falcon Server for the changes to take effect.
+<verbatim>
+*.dag.engine.impl=org.apache.falcon.workflow.engine.OozieDAGEngine
+*.application.services=org.apache.falcon.security.AuthenticationInitializationService,\
+                        org.apache.falcon.workflow.WorkflowJobEndNotificationService, \
+                        org.apache.falcon.service.ProcessSubscriberService,\
+                        org.apache.falcon.service.FeedSLAMonitoringService,\
+                        org.apache.falcon.service.LifecyclePolicyMap,\
+                        org.apache.falcon.state.store.service.FalconJPAService,\
+                        org.apache.falcon.entity.store.ConfigurationStore,\
+                        org.apache.falcon.rerun.service.RetryService,\
+                        org.apache.falcon.rerun.service.LateRunService,\
+                        org.apache.falcon.metadata.MetadataMappingService,\
+                        org.apache.falcon.service.LogCleanupService,\
+                        org.apache.falcon.service.GroupsService,\
+                        org.apache.falcon.service.ProxyUserService,\
+                        org.apache.falcon.notification.service.impl.JobCompletionService,\
+                        org.apache.falcon.notification.service.impl.SchedulerService,\
+                        org.apache.falcon.notification.service.impl.AlarmService,\
+                        org.apache.falcon.notification.service.impl.DataAvailabilityService,\
+                        org.apache.falcon.execution.FalconExecutionService
+</verbatim>
+
+---+++ Making the Native Scheduler the default scheduler
+To ensure backward compatibility, even when the native scheduler is enabled, the default
scheduler is still Oozie. This means users will be scheduling entities on Oozie scheduler,
by default. They will need to explicitly specify the scheduler as native, if they wish to
schedule entities using native scheduler. 
+
+<a href="#Scheduling_new_entities_on_Native_Scheduler">This section</a> has more
details on how to schedule on either of the schedulers. 
+
+If you wish to make the Falcon Native Scheduler your default scheduler and remove Oozie as
the scheduler, set the following property in __$FALCON_HOME/conf/startup.properties__
+<verbatim>
+## If you wish to use Falcon native scheduler as your default scheduler, set the workflow
engine to FalconWorkflowEngine instead of OozieWorkflowEngine. ##
+*.workflow.engine.impl=org.apache.falcon.workflow.engine.FalconWorkflowEngine
+</verbatim>
+
+---+++ Configuring the state store for Native Scheduler
+
+Falcon Server needs to maintain state of the entities and instances in a persistent store
for the system to be recoverable. Since Prism only federates, it does not need to maintain
any state information. Following properties need to be set in startup.properties of Falcon
Servers:
+<verbatim>
+######### StateStore Properties #####
+*.falcon.state.store.impl=org.apache.falcon.state.store.jdbc.JDBCStateStore
+*.falcon.statestore.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
+*.falcon.statestore.jdbc.url=jdbc:derby:data/falcon.db
+*.falcon.statestore.jdbc.username=sa
+*.falcon.statestore.jdbc.password=
+*.falcon.statestore.connection.data.source=org.apache.commons.dbcp.BasicDataSource
+# Maximum number of active connections that can be allocated from this pool at the same time.
+*.falcon.statestore.pool.max.active.conn=10
+*.falcon.statestore.connection.properties=
+# Indicates the interval (in milliseconds) between eviction runs.
+*.falcon.statestore.validate.db.connection.eviction.interval=300000
+## The number of objects to examine during each run of the idle object evictor thread.
+*.falcon.statestore.validate.db.connection.eviction.num=10
+## Creates Falcon DB.
+## If set to true, it creates the DB schema if it does not exist. If the DB schema exists
is a NOP.
+## If set to false, it does not create the DB schema. If the DB schema does not exist it
fails start up.
+*.falcon.statestore.create.db.schema=true
+</verbatim> 
+
+The _*.falcon.statestore.jdbc.url_ property in startup.properties determines the DB and data
location. All other properties are common across RDBMS.
+
+*NOTE : Although multiple Falcon Servers can share a DB (not applicable for Derby DB), it
is recommended that you have different DBs for different Falcon Servers for better performance.*
+
+You will need to create the state DB and tables before starting the Falcon Server. To create
tables, a tool comes bundled with the Falcon installation. You can use the _falcon-db.sh_
script to create tables in the DB. The script needs to be run only for Falcon Servers and
can be run by any user that has execute permission on the script. The script picks up the
DB connection details from __$FALCON_HOME/conf/startup.properties__. Ensure that you have
granted the right privileges to the user mentioned in _startup.properties_, so the tables
can be created.  
+
+You can use the help command to get details on the sub-commands supported:
+<verbatim>
+./bin/falcon-db.sh help
+Hadoop home is set, adding libraries from '/Users/pallavi.rao/falcon/hadoop-2.6.0/bin/hadoop
classpath' into falcon classpath
+usage: 
+      Falcon DB initialization tool currently supports Derby DB/ Mysql
+
+      falcondb help : Display usage for all commands or specified command
+
+      falcondb version : Show Falcon DB version information
+
+      falcondb create <OPTIONS> : Create Falcon DB schema
+                      -run             Confirmation option regarding DB schema creation/upgrade
+                      -sqlfile <arg>   Generate SQL script instead of creating/upgrading
the DB
+                                       schema
+
+      falcondb upgrade <OPTIONS> : Upgrade Falcon DB schema
+                       -run             Confirmation option regarding DB schema creation/upgrade
+                       -sqlfile <arg>   Generate SQL script instead of creating/upgrading
the DB
+                                        schema
+
+</verbatim>
+Currently, MySQL and Derby are supported as state stores. We may extend support to other
DBs in the future. Falcon has been tested against MySQL v5.5.
+
+---++++ Using Derby as the State Store
+Using Derby is ideal for QA and staging setup. Falcon comes bundled with a Derby connector
and no explicit setup is required (although you can set it up) in terms creating the DB or
tables.
+For example,
+ <verbatim> *.falcon.statestore.jdbc.url=jdbc:derby:data/falcon.db;create=true </verbatim>
+
+ tells Falcon to use the Derby JDBC connector, with data directory, $FALCON_HOME/data/ and
DB name 'falcon'. If _create=true_ is specified, you will not need to create a DB up front;
a database will be created if it does not exist.
+
+---++++ Using MySQL as the State Store
+The jdbc.url property in startup.properties determines the DB and data location.
+For example,
+ <verbatim> *.falcon.statestore.jdbc.url=jdbc:mysql://localhost:3306/falcon </verbatim>
+
+ tells Falcon to use the MySQL JDBC connector, which is accessible @localhost:3306, with
DB name 'falcon'.
+
+---++ Scheduling new entities on Native Scheduler
+To schedule an entity (currently only process is supported) using the native scheduler, you
need to specify the scheduler in the schedule command as shown below:
+<verbatim>
+$FALCON_HOME/bin/falcon entity -type process -name <process name> -schedule -properties
falcon.scheduler:native
+</verbatim>
+
+If Oozie is configured as the default scheduler, you can skip the scheduler option or explicitly
set it to _oozie_, as shown below:
+<verbatim>
+$FALCON_HOME/bin/falcon entity -type process -name <process name> -schedule
+OR
+$FALCON_HOME/bin/falcon entity -type process -name <process name> -schedule -properties
falcon.scheduler:oozie
+</verbatim>
+
+If the native scheduler is configured as the default scheduler, then, you can omit the scheduler
option, as shown below:
+<verbatim>
+$FALCON_HOME/bin/falcon entity -type process -name <process name> -schedule 
+</verbatim>
+
+---++ Migrating entities from Oozie Scheduler to Native Scheduler
+Currently, user will have to delete and re-create entities in order to move across schedulers.
Attempting to schedule an already scheduled entity on a different scheduler will result in
an error. Note that the history of instances prior to scheduling on native scheduler will
not be available via the instance APIs. However, user can retrieve that information using
metadata APIs. Native scheduler must be enabled before migrating entities to native scheduler.
+
+<a href="#Configuring_Native_Scheduler">Configuring Native Scheduler</a> has
more details on how to enable native scheduler.
+
+---+++ Migrating from Oozie to Native Scheduler
+   * Delete the entity (process). 
+<verbatim>$FALCON_HOME/bin/falcon entity -type process -name <process name> -delete
</verbatim>
+   * Submit the entity (process) with start time from where the Oozie scheduler left off.

+<verbatim>$FALCON_HOME/bin/falcon entity -type process -submit <path to process
xml> </verbatim>
+   * Schedule the entity on native scheduler. 
+<verbatim> $FALCON_HOME/bin/falcon entity -type process -name <process name>
-schedule -properties falcon.scheduler:native </verbatim>
+
+---+++ Reverting to Oozie from Native Scheduler
+   * Delete the entity (process). 
+<verbatim>$FALCON_HOME/bin/falcon entity -type process -name <process name> -delete
</verbatim>
+   * Submit the entity (process) with start time from where the Native scheduler left off.

+<verbatim>$FALCON_HOME/bin/falcon entity -type process -submit <path to process
xml> </verbatim>
+   * Schedule the entity on the default scheduler (Oozie).
+ <verbatim> $FALCON_HOME/bin/falcon entity -type process -name <process name>
-schedule </verbatim>

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/docs/src/site/twiki/falconcli/Schedule.twiki
----------------------------------------------------------------------
diff --git a/docs/src/site/twiki/falconcli/Schedule.twiki b/docs/src/site/twiki/falconcli/Schedule.twiki
index 63aa9c1..c4422e7 100644
--- a/docs/src/site/twiki/falconcli/Schedule.twiki
+++ b/docs/src/site/twiki/falconcli/Schedule.twiki
@@ -13,10 +13,10 @@ Optional Args :
 
 -doAs <username>
 
--properties <<key1:val1,...,keyN:valN>>. Specifying 'falcon.scheduler:native'
as a property will schedule the entity on the the native scheduler of Falcon. Else, it will
default to the engine specified in startup.properties.
+-properties <<key1:val1,...,keyN:valN>>. Specifying 'falcon.scheduler:native'
as a property will schedule the entity on the the native scheduler of Falcon. Else, it will
default to the engine specified in startup.properties. For details on Native scheduler, refer
to [[FalconNativeScheduler][Falcon Native Scheduler]]
 
 Examples:
 
  $FALCON_HOME/bin/falcon entity  -type process -name sampleProcess -schedule
 
- $FALCON_HOME/bin/falcon entity  -type process -name sampleProcess -schedule -properties
falcon.scheduler:native
\ No newline at end of file
+ $FALCON_HOME/bin/falcon entity  -type process -name sampleProcess -schedule -properties
falcon.scheduler:native

http://git-wip-us.apache.org/repos/asf/falcon/blob/8be383eb/src/conf/startup.properties
----------------------------------------------------------------------
diff --git a/src/conf/startup.properties b/src/conf/startup.properties
index 95d792b..ef0a2d5 100644
--- a/src/conf/startup.properties
+++ b/src/conf/startup.properties
@@ -51,10 +51,7 @@
                         org.apache.falcon.service.LogCleanupService,\
                         org.apache.falcon.service.GroupsService,\
                         org.apache.falcon.service.ProxyUserService
-## If you wish to use Falcon native scheduler uncomment out below  application services and
-# comment out above application
-#
-#services ##
+## If you wish to use Falcon native scheduler uncomment out below  application services and
comment out above application services ##
 #*.application.services=org.apache.falcon.security.AuthenticationInitializationService,\
 #                        org.apache.falcon.workflow.WorkflowJobEndNotificationService, \
 #                        org.apache.falcon.service.ProcessSubscriberService,\
@@ -282,12 +279,13 @@ prism.configstore.listeners=org.apache.falcon.entity.v0.EntityGraph,\
 #*.falcon.state.store.impl=org.apache.falcon.state.store.jdbc.JDBCStateStore
 #*.falcon.statestore.jdbc.driver=org.apache.derby.jdbc.EmbeddedDriver
 ## Falcon currently supports derby and mysql, change url based on DB.
-#*.falcon.statestore.jdbc.url=jdbc:derby:data/statestore.db;create=true
+#*.falcon.statestore.jdbc.url=jdbc:derby:data/falcon.db;create=true
 #*.falcon.statestore.jdbc.username=sa
 #*.falcon.statestore.jdbc.password=
 #*.falcon.statestore.connection.data.source=org.apache.commons.dbcp.BasicDataSource
 ## Maximum number of active connections that can be allocated from this pool at the same
time.
 #*.falcon.statestore.pool.max.active.conn=10
+## Any additional connection properties that need to be used, specified as comma separated
key=value pairs.
 #*.falcon.statestore.connection.properties=
 ## Indicates the interval (in milliseconds) between eviction runs.
 #*.falcon.statestore.validate.db.connection.eviction.interval=300000


Mime
View raw message