atlas-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject incubator-atlas git commit: ATLAS-181 Integrate storm topology metadata into Atlas (svenkat, yhemanth via shwethags)
Date Wed, 20 Jan 2016 07:42:53 GMT
Repository: incubator-atlas
Updated Branches:
  refs/heads/master 0003160b0 -> f3ac2c0f1

ATLAS-181 Integrate storm topology metadata into Atlas (svenkat,yhemanth via shwethags)


Branch: refs/heads/master
Commit: f3ac2c0f1e2b9e7db2aa867b2a05a68dfee75f1b
Parents: 0003160
Author: Shwetha GS <>
Authored: Wed Jan 20 13:12:49 2016 +0530
Committer: Shwetha GS <>
Committed: Wed Jan 20 13:12:49 2016 +0530

 docs/src/site/twiki/Architecture.twiki   |   1 +
 docs/src/site/twiki/StormAtlasHook.twiki | 114 ++++++++++++++++++++++++++
 docs/src/site/twiki/index.twiki          |   2 +-
 release-log.txt                          |   1 +
 4 files changed, 117 insertions(+), 1 deletion(-)
diff --git a/docs/src/site/twiki/Architecture.twiki b/docs/src/site/twiki/Architecture.twiki
index 7bfe4b6..d63ac62 100755
--- a/docs/src/site/twiki/Architecture.twiki
+++ b/docs/src/site/twiki/Architecture.twiki
@@ -25,6 +25,7 @@ Available bridges are:
    * [[Bridge-Hive][Hive Bridge]]
    * [[Bridge-Sqoop][Sqoop Bridge]]
    * [[Bridge-Falcon][Falcon Bridge]]
+   * [[StormAtlasHook][Storm Bridge]]
 ---++ Notification
 Notification is used for reliable entity registration from hooks and for entity/type change
notifications. Atlas, by default, provides Kafka integration, but its possible to provide
other implementations as well. Atlas service starts embedded Kafka server by default.
diff --git a/docs/src/site/twiki/StormAtlasHook.twiki b/docs/src/site/twiki/StormAtlasHook.twiki
new file mode 100644
index 0000000..3e560db
--- /dev/null
+++ b/docs/src/site/twiki/StormAtlasHook.twiki
@@ -0,0 +1,114 @@
+---+ Storm Atlas Bridge
+---++ Introduction
+Apache Storm is a distributed real-time computation system. Storm makes it
+easy to reliably process unbounded streams of data, doing for real-time
+processing what Hadoop did for batch processing. The process is essentially
+a DAG of nodes, which is called *topology*.
+Apache Atlas is a metadata repository that enables end-to-end data lineage,
+search and associate business classification.
+The goal of this integration is to push the operational topology
+metadata along with the underlying data source(s), target(s), derivation
+processes and any available business context so Atlas can capture the
+lineage for this topology.
+There are 2 parts in this process detailed below:
+   * Data model to represent the concepts in Storm
+   * Storm Atlas Hook to update metadata in Atlas
+---++ Storm Data Model
+A data model is represented as Types in Atlas. It contains the descriptions
+of various nodes in the topology graph, such as spouts and bolts and the
+corresponding producer and consumer types.
+The following types are added in Atlas.
+   * storm_topology - represents the coarse-grained topology. A storm_topology derives from
an Atlas Process type and hence can be used to inform Atlas about lineage.
+   * Following data sets are added - kafka_topic, jms_topic, hbase_table, hdfs_data_set.
These all derive from an Atlas Dataset type and hence form the end points of a lineage graph.
+   * storm_spout - Data Producer having outputs, typically Kafka, JMS
+   * storm_bolt - Data Consumer having inputs and outputs, typically Hive, HBase, HDFS, etc.
+The Storm Atlas hook auto registers dependent models like the Hive data model
+if it finds that these are not known to the Atlas server.
+The data model for each of the types is described in
+the class definition at org.apache.atlas.storm.model.StormDataModel.
+---++ Storm Atlas Hook
+Atlas is notified when a new topology is registered successfully in
+Storm. Storm provides a hook, backtype.storm.ISubmitterHook, at the Storm client used to
+submit a storm topology.
+The Storm Atlas hook intercepts the hook post execution and extracts the metadata from the
+topology and updates Atlas using the types defined. Atlas implements the
+Storm client hook interface in org.apache.atlas.storm.hook.StormAtlasHook.
+---++ Limitations
+The following apply for the first version of the integration.
+   * Only new topology submissions are registered with Atlas, any lifecycle changes are not
reflected in Atlas.
+   * The Atlas server needs to be online when a Storm topology is submitted for the metadata
to be captured.
+   * The Hook currently does not support capturing lineage for custom spouts and bolts.
+---++ Installation
+The Storm Atlas Hook needs to be manually installed in Storm on the client side. The hook
+artifacts are available at: $ATLAS_PACKAGE/hook/storm
+Storm Atlas hook jars need to be copied to $STORM_HOME/extlib.
+Replace STORM_HOME with storm installation path.
+Restart all daemons after you have installed the atlas hook into Storm.
+---++ Configuration
+---+++ Storm Configuration
+The Storm Atlas Hook needs to be configured in Storm client config
+in *$STORM_HOME/conf/storm.yaml* as:
+storm.topology.submission.notifier.plugin.class: "org.apache.atlas.storm.hook.StormAtlasHook"
+Also set a 'cluster name' that would be used as a namespace for objects registered in Atlas.
+This name would be used for namespacing the Storm topology, spouts and bolts.
+The other objects like data sets should ideally be identified with the cluster name of
+the components that generate them. For e.g. Hive tables and databases should be
+identified using the cluster name set in Hive. The Storm Atlas hook will pick this up
+if the Hive configuration is available in the Storm topology jar that is submitted on
+the client and the cluster name is defined there. This happens similarly for HBase
+data sets. In case this configuration is not available, the cluster name set in the Storm
+configuration will be used.
+<verbatim> "cluster_name"
+In *$STORM_HOME/conf/storm_env.ini*, set an environment variable as follows:
+where ATLAS_HOME is pointing to where ATLAS is installed.
+You could also set this up programatically in Storm Config as:
+    Config stormConf = new Config();
+    ...
+            org.apache.atlas.storm.hook.StormAtlasHook.class.getName());
diff --git a/docs/src/site/twiki/index.twiki b/docs/src/site/twiki/index.twiki
index 6f7333c..8c57d06 100755
--- a/docs/src/site/twiki/index.twiki
+++ b/docs/src/site/twiki/index.twiki
@@ -49,9 +49,9 @@ allows integration with the whole enterprise data ecosystem.
       * [[Bridge-Hive][Hive Bridge]]
       * [[Bridge-Sqoop][Sqoop Bridge]]
       * [[Bridge-Falcon][Falcon Bridge]]
+      * [[StormAtlasHook][Storm Bridge]]
    * [[HighAvailability][Fault Tolerance And High Availability Options]]
 ---++ API Documentation
    * <a href="api/rest.html">REST API Documentation</a>
diff --git a/release-log.txt b/release-log.txt
index 5f3fd06..819300a 100644
--- a/release-log.txt
+++ b/release-log.txt
@@ -7,6 +7,7 @@ ATLAS-409 Atlas will not import avro tables with schema read from a file (dosset
 ATLAS-379 Create sqoop and falcon metadata addons (venkatnrangan,bvellanki,sowmyaramesh via
+ATLAS-181 Integrate storm topology metadata into Atlas (svenkat,yhemanth via shwethags)
 ATLAS-311 UI: Local storage for traits - caching [not cleared on refresh] To be cleared on
time lapse for 1hr (Anilg via shwethags)
 ATLAS-106 Store createTimestamp and modified timestamp separately for an entity (dkantor
via shwethags)
 ATLAS-433 Fix checkstyle issues for common and notification module (shwethags)

View raw message