atlas-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [1/5] incubator-atlas git commit: ATLAS-1022 Update typesystem wiki with details (yhemanth via shwethags)
Date Wed, 20 Jul 2016 13:03:04 GMT
Repository: incubator-atlas
Updated Branches:
  refs/heads/master 0cf250099 -> 8e8b51b8e

ATLAS-1022 Update typesystem wiki with details (yhemanth via shwethags)


Branch: refs/heads/master
Commit: f672aaeffdad9539b52d0122cc2593e7352d6329
Parents: 0cf2500
Author: Shwetha GS <>
Authored: Wed Jul 20 12:10:10 2016 +0530
Committer: Shwetha GS <>
Committed: Wed Jul 20 12:10:10 2016 +0530

 .../site/resources/images/twiki/data-types.png  | Bin 413738 -> 0 bytes
 .../resources/images/twiki/types-instance.png   | Bin 445893 -> 0 bytes
 docs/src/site/twiki/TypeSystem.twiki            | 266 ++++++++++++++-----
 release-log.txt                                 |   1 +
 4 files changed, 197 insertions(+), 70 deletions(-)
diff --git a/docs/src/site/resources/images/twiki/data-types.png b/docs/src/site/resources/images/twiki/data-types.png
deleted file mode 100755
index 3aa1904..0000000
Binary files a/docs/src/site/resources/images/twiki/data-types.png and /dev/null differ
diff --git a/docs/src/site/resources/images/twiki/types-instance.png b/docs/src/site/resources/images/twiki/types-instance.png
deleted file mode 100755
index 6afca21..0000000
Binary files a/docs/src/site/resources/images/twiki/types-instance.png and /dev/null differ
diff --git a/docs/src/site/twiki/TypeSystem.twiki b/docs/src/site/twiki/TypeSystem.twiki
index 64cf5a2..b658cfa 100755
--- a/docs/src/site/twiki/TypeSystem.twiki
+++ b/docs/src/site/twiki/TypeSystem.twiki
@@ -1,74 +1,200 @@
 ---+ Type System
----++ Introduction
 ---++ Overview
----+++ Data Types Overview
-<img src="images/twiki/data-types.png" height="400" width="600" />
----+++ Types Instances Overview
-<img src="images/twiki/types-instance.png" height="400" width="600" />
----++ Details
-### Structs are like C structs - they don't have an identity
-- no independent lifecycle
-- like a bag of properties
-- like in hive, also
-### Classes are classes
-- like any OO class
-- have identity
-- can have inheritence
-- can contain structs
-- don't necessarily need to use a struct inside the class to define props
-- can also define props using !AttributeDefinition using the basic data types
-- classes are immutable once created
-### On search interface:
-- can search for all instances of a class
-- classes could become tables in a relational system, for instance
-	- also databases, columns, etc.
-### Traits is similar to scala - traits more like decorators (?)
-- traits get applied to instances - not classes
-	- this satisfies the classification mechanism (ish)
-- can have a class instance have any number of traits
-- e.g. security clearance - any Person class could have it; so we add it as a mixin to the
Person class
-	- security clearance trait has a level attribute
-	- traits are labels
-	- each label can have its own attribute
-- reason for doing this is:
-	- modeled security clearance trait
-	- want to prescribe it to other things, too
-	- can now search for anything that has security clearance level = 1, for instance
-### On Instances:
-- class, trait, struct all have bags of attributes
-- can get name of type associated with attribute
-- can get or set the attribute in that bag for each instance
-### On Classification:
-- create column as a class
-- create a trait to classify as "PHI"
-- would create the instance of the column with the PHI trait
-- apply traits to instances
-- CAN'T apply traits to class
-### Other useful information
-!HierarchicalClassType - base type for !ClassType and !TraitType
-Instances created from Definitions
-Every instance is referenceable - i.e. something can point to it in the graph db
-!MetadataService may not be used longterm - it is currently used for bootstrapping the repo
& type system
-Id class - represents the Id of an instance
-When the web service receives an object graph, the !ObjectGraphWalker is used to update things
-	- !DiscoverInstances is used to discover the instances in the object graph received by the
web service
-!MapIds assigns new ids to the discovered instances in the object graph
-Anything under the storage package is not part of the public interface
\ No newline at end of file
+Atlas allows users to define a model for the metadata objects they want to manage. The model
is composed of definitions
+called ‘types’. Instances of ‘types’ called ‘entities’ represent the actual metadata
objects that are managed. The Type
+System is a component that allows users to define and manage the types and entities. All
metadata objects managed by
+Atlas out of the box (like Hive tables, for e.g.) are modelled using types and represented
as entities. To store new
+types of metadata in Atlas, one needs to understand the concepts of the type system component.
+---++ Types
+A ‘Type’ in Atlas is a definition of how a particular type of metadata objects are stored
and accessed. A type
+represents one or a collection of attributes that define the properties for the metadata
object. Users with a
+development background will recognize the similarity of a type to a ‘Class’ definition
of object oriented programming
+languages, or a ‘table schema’ of relational databases.
+An example of a type that comes natively defined with Atlas is a Hive table. A Hive table
is defined with these
+Name: hive_table
+MetaType: Class
+SuperTypes: DataSet
+    name: String (name of the table)
+    db: Database object of type hive_db
+    owner: String
+    createTime: Date
+    lastAccessTime: Date
+    comment: String
+    retention: int
+    sd: Storage Description object of type hive_storagedesc
+    partitionKeys: Array of objects of type hive_column
+    aliases: Array of strings
+    columns: Array of objects of type hive_column
+    parameters: Map of String keys to String values
+    viewOriginalText: String
+    viewExpandedText: String
+    tableType: String
+    temporary: Boolean
+The following points can be noted from the above example:
+   * A type in Atlas is identified uniquely by a ‘name’
+   * A type has a metatype. A metatype represents the type of this model in Atlas. Atlas
has the following metatypes:
+      * Basic metatypes: E.g. Int, String, Boolean etc.
+      * Enum metatypes
+      * Collection metatypes: E.g. Array, Map
+      * Composite metatypes: E.g. Class, Struct, Trait
+   * A type can ‘extend’ from a parent type called ‘supertype’ - by virtue of this,
it will get to include the attributes that are defined in the supertype as well. This allows
modellers to define common attributes across a set of related types etc. This is again similar
to the concept of how Object Oriented languages define super classes for a class. It is also
possible for a type in Atlas to extend from multiple super types.
+      * In this example, every hive table extends from a pre-defined supertype called a ‘DataSet’.
More details about this pre-defined types will be provided later.
+   * Types which have a metatype of ‘Class’, ‘Struct’ or ‘Trait’ can have a collection
of attributes. Each attribute has a name (e.g.  ‘name’) and some other associated properties.
A property can be referred to using an expression type_name.attribute_name. It is also good
to note that attributes themselves are defined using Atlas metatypes.
+      * In this example, is a String, hive_table.aliases is an array of Strings,
hive_table.db refers to an instance of a type called hive_db and so on.
+   * Type references in attributes, (like hive_table.db) are particularly interesting. Note
that using such an attribute, we can define arbitrary relationships between two types defined
in Atlas and thus build rich models. Note that one can also collect a list of references as
an attribute type (e.g. hive_table.cols which represents a list of references from hive_table
to the hive_column type)
+---++ Entities
+An ‘entity’ in Atlas is a specific value or instance of a Class ‘type’ and thus represents
a specific metadata object
+in the real world. Referring back to our analogy of Object Oriented Programming languages,
an ‘instance’ is an
+‘Object’ of a certain ‘Class’.
+An example of an entity will be a specific Hive Table. Say Hive has a table called ‘customers’
in the ‘default’
+database. This table will be an ‘entity’ in Atlas of type hive_table. By virtue of being
an instance of a class
+type, it will have values for every attribute that are a part of the Hive table ‘type’,
such as:
+id: "9ba387dd-fa76-429c-b791-ffc338d3c91f"
+typeName: “hive_table”
+    name: “customers”
+    db: "b42c6cfc-c1e7-42fd-a9e6-890e0adf33bc"
+    owner: “admin”
+    createTime: "2016-06-20T06:13:28.000Z"
+    lastAccessTime: "2016-06-20T06:13:28.000Z"
+    comment: null
+    retention: 0
+    sd: "ff58025f-6854-4195-9f75-3a3058dd8dcf"
+    partitionKeys: null
+    aliases: null
+    columns: ["65e2204f-6a23-4130-934a-9679af6a211f", "d726de70-faca-46fb-9c99-cf04f6b579a6",
+    parameters: {"transient_lastDdlTime": "1466403208"}
+    viewOriginalText: null
+    viewExpandedText: null
+    tableType: “MANAGED_TABLE”
+    temporary: false
+The following points can be noted from the example above:
+   * Every entity that is an instance of a Class type is identified by a unique identifier,
a GUID. This GUID is generated by the Atlas server when the object is defined, and remains
constant for the entire lifetime of the entity. At any point in time, this particular entity
can be accessed using its GUID.
+      * In this example, the ‘customers’ table in the default database is uniquely identified
by the GUID "9ba387dd-fa76-429c-b791-ffc338d3c91f"
+   * An entity is of a given type, and the name of the type is provided with the entity definition.
+      * In this example, the ‘customers’ table is a ‘hive_table.
+   * The values of this entity are a map of all the attribute names and their values for
attributes that are defined in the hive_table type definition.
+   * Attribute values will be according to the metatype of the attribute.
+      * Basic metatypes: integer, String, boolean values. E.g. ‘name’ = ‘customers’,
‘Temporary’ = ‘false’
+      * Collection metatypes: An array or map of values of the contained metatype. E.g. parameters
= { “transient_lastDdlTime”: “1466403208”}
+      * Composite metatypes: For classes, the value will be an entity with which this particular
entity will have a relationship. E.g. The hive table “customers” is present in a database
called “default”. The relationship between the table and database are captured via the
“db” attribute. Hence, the value of the “db” attribute will be a GUID that uniquely
identifies the hive_db entity called “default”
+With this idea on entities, we can now see the difference between Class and Struct metatypes.
Classes and Structs
+both compose attributes of other types. However, entities of Class types have the Id attribute
(with a GUID value) a
+nd can be referenced from other entities (like a hive_db entity is referenced from a hive_table
entity). Instances of
+Struct types do not have an identity of their own. The value of a Struct type is a collection
of attributes that are
+‘embedded’ inside the entity itself.
+---++ Attributes
+We already saw that attributes are defined inside composite metatypes like Class and Struct.
But we simplistically
+referred to attributes as having a name and a metatype value. However, attributes in Atlas
have some more properties
+that define more concepts related to the type system.
+An attribute has the following properties:
+    name: string,
+    dataTypeName: string,
+    isComposite: boolean,
+    isIndexable: boolean,
+    isUnique: boolean,
+    multiplicity: enum,
+    reverseAttributeName: string
+The properties above have the following meanings:
+   * name - the name of the attribute
+   * dataTypeName - the metatype name of the attribute (native, collection or composite)
+   * isComposite -
+      * This flag indicates an aspect of modelling. If an attribute is defined as composite,
it means that it cannot have a lifecycle independent of the entity it is contained in. A good
example of this concept is the set of columns that make a part of a hive table. Since the
columns do not have meaning outside of the hive table, they are defined as composite attributes.
+      * A composite attribute must be created in Atlas along with the entity it is contained
in. i.e. A hive column must be created along with the hive table.
+   * isIndexable -
+      * This flag indicates whether this property should be indexed on, so that look ups
can be performed using the attribute value as a predicate and can be performed efficiently.
+   * isUnique -
+      * This flag is again related to indexing. If specified to be unique, it means that
a special index is created for this attribute in Titan that allows for equality based look
+      * Any attribute with a true value for this flag is treated like a primary key to distinguish
this entity from other entities. Hence care should be taken ensure that this attribute does
model a unique property in real world.
+         * For e.g. consider the name attribute of a hive_table. In isolation, a name is
not a unique attribute for a hive_table, because tables with the same name can exist in multiple
databases. Even a pair of (database name, table name) is not unique if Atlas is storing metadata
of hive tables amongst multiple clusters. Only a cluster location, database name and table
name can be deemed unique in the physical world.
+   * multiplicity - indicates whether this attribute is required, optional, or could be multi-valued.
If an entity’s definition of the attribute value does not match the multiplicity declaration
in the type definition, this would be a constraint violation and the entity addition will
fail. This field can therefore be used to define some constraints on the metadata information.
+Using the above, let us expand on the attribute definition of one of the attributes of the
hive table below.
+Let us look at the attribute called ‘db’ which represents the database to which the hive
table belongs:
+    "dataTypeName": "hive_db",
+    "isComposite": false,
+    "isIndexable": true,
+    "isUnique": false,
+    "multiplicity": "required",
+    "name": "db",
+    "reverseAttributeName": null
+Note the “required” constraint on multiplicity. A table entity cannot be sent without
a db reference.
+    "dataTypeName": "array<hive_column>",
+    "isComposite": true,
+    "isIndexable": true,
+    “isUnique": false,
+    "multiplicity": "optional",
+    "name": "columns",
+    "reverseAttributeName": null
+Note the “isComposite” true value for columns. By doing this, we are indicating that
the defined column entities should
+always be bound to the table entity they are defined with.
+From this description and examples, you will be able to realize that attribute definitions
can be used to influence
+specific modelling behavior (constraints, indexing, etc) to be enforced by the Atlas system.
+---++ System specific types and their significance
+Atlas comes with a few pre-defined system types. We saw one example (DataSet) in the preceding
sections. In this
+section we will see all these types and understand their significance.
+*Referenceable*: This type represents all entities that can be searched for using a unique
attribute called
+*Asset*: This type contains attributes like name, description and owner. Name is a required
+(multiplicity = required), the others are optional. The purpose of Referenceable and Asset
is to provide modellers
+with way to enforce consistency when defining and querying entities of their own types. Having
these fixed set of
+attributes allows applications and User interfaces to make convention based assumptions about
what attributes they can
+expect of types by default.
+*Infrastructure*: This type extends Referenceable and Asset and typically can be used to
be a common super type for
+infrastructural metadata objects like clusters, hosts etc.
+*!DataSet*: This type extends Referenceable and Asset. Conceptually, it can be used to represent
an type that stores
+data. In Atlas, hive tables, Sqoop RDBMS tables etc are all types that extend from !DataSet.
Types that extend !DataSet
+can be expected to have a Schema in the sense that they would have an attribute that defines
attributes of that dataset.
+For e.g. the columns attribute in a hive_table. Also entities of types that extend !DataSet
participate in data
+transformation and this transformation can be captured by Atlas via lineage (or provenance)
+*Process*: This type extends Referenceable and Asset. Conceptually, it can be used to represent
any data transformation
+operation. For example, an ETL process that transforms a hive table with raw data to another
hive table that stores
+some aggregate can be a specific type that extends the Process type. A Process type has two
specific attributes,
+inputs and outputs. Both  inputs and outputs are arrays of !DataSet entities. Thus an instance
of a Process type can
+use these inputs and outputs to capture how the lineage of a !DataSet evolves.
\ No newline at end of file
diff --git a/release-log.txt b/release-log.txt
index 824778e..40de319 100644
--- a/release-log.txt
+++ b/release-log.txt
+ATLAS-1022 Update typesystem wiki with details (yhemanth via shwethags)
 ATLAS-1021 Update Atlas architecture wiki (yhemanth via sumasai)
 ATLAS-957 Atlas is not capturing topologies that have $ in the data payload (shwethags)
 ATLAS-1032 Atlas hook package should not include libraries already present in host component
- like log4j (mneethiraj via sumasai)

View raw message