hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/ViewDev" by JohnSichi
Date Tue, 22 Dec 2009 00:33:56 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/ViewDev" page has been changed by JohnSichi.
http://wiki.apache.org/hadoop/Hive/ViewDev

--------------------------------------------------

New page:
= Use Cases =

Views are a standard DBMS feature and their uses are well understood.  A typical use case
might be to create an interface layer with a consistent entity/attribute naming scheme on
top of an existing set of inconsistently named tables, without having to cause disruption
due to direct modification of the tables.  More advanced use cases would involve predefined
filters, joins, aggregations, etc for simplifying query construction by end users, as well
as sharing common definitions within ETL pipelines.

= Scope =

At a minimum, we want to 

* add queryable view support at the SQL language level (specifics of the scoping are under
discussion in the Issues section below)
** updatable views will not be supported
* make sure views show up anywhere tables can currently be enumerated/searched
* where relevant, provide additional metadata to allow views to be distinguished from tables

Beyond this, we may want to

* expose metadata about view definitions and dependencies (at table-level or column-level)
in a way that makes them consumable by metadata-driven tools

= Implementation Sketch =

The basics of view implementation are very easy due to the fact that Hive already supports
subselects in the FROM clause.

* For <b>CREATE VIEW v AS view-def-select</b>, we extend SemanticAnalyzer to behave
similarly to <b>CREATE TABLE t AS select</b>, except that we don't actually execute
the query (we stop after plan generation).  It's necessary to perform all of plan generation
(even though we're not actually going to execute the plan) since currently some validations
such as type compatibility-checking are only performed during plan generation.  After successful
validation, the text of the view is saved in the metastore (the simplest approach snips out
the text from the parser's token stream, but this approach introduces problems described in
the issues section below).
* For <b>select ... from view-reference</b>, we detect the view reference in SemanticAnalyzer.getMetaData,
load the text of its definition from the metastore, parse it back into an AST, prepare a QBExpr
to hold it, and then plug this into the referencing query's QB, resulting in a tree equivalent
to <b>select ... from (view-def-select)</b>; plan generation can then be carried
out on the combined tree.

= Issues =

Some of these are related to functionality/scope; others are related to implementation approaches.
 Opinions are welcome on all of them.

== Stored View Definition ==

In SQL:200n, a view definition is supposed to be frozen at the time it is created, so that
if the view is defined as select * from t, where t is a table with two columns a and b, then
later requests to select * from the view should return just columns a and b, even if a new
column c is later added to the table.  This is implemented correctly by most DBMS products.

There are similar issues with other kinds of references in the view definition; for example,
if a table or function name can be qualified, then the reference should be bound at the time
the view is created.

Implementing this typically requires expanding the view definition into an explicit form rather
than storing the original view definition text directly.  Doing this could require adding
"unparse" support to the AST model (to be applied after object name resolution takes place),
something which is not currently present (and which is also useful to have available in general).

== Metastore Modeling ==

The metastore model will need to be augmented in order to allow view definitions to be saved.
 An important issue to be resolved is whether to model this via inheritance, or just shoehorn
views in as a special kind of table.

With an inheritance model, views and base tables would share a common base class (here called
ColumnSet following the convention in the Common Warehouse Metamodel for lack of a better
term):

[[Image:Data--Hive--ViewDev--HiveViewInheritance.png]]

For a view, most of the storage descriptor (everything other than the column names and types)
would be irrelevant, so this model could be further refined with such discriminations.

View names and table names share the same namespace with respect to uniqueness (i.e. you can't
have a table and a view with the same name), so the name key uniqueness would need to be specified
at the base class level.

Alternately, if we choose to avoid inheritance, then we could just add a new viewText attribute
to the existing Table class (leaving it null for base tables):

[[Image:Data--Hive--ViewDev--HiveViewFlat.png]]

(Storing the view definition as a table property may not work since property values are limited
to VARCHAR(767), and view definitions may be much longer than that, so we'll need to use a
LOB.)

Comparison of the two approaches:

{| border="1"
|-
! 
! Inheritance Model
! Flat Model
|-
! JDO Support
| Need to investigate how well inheritance works for our purposes
| Nothing special
|-
! Metadata queries from existing code/tools
| Existing queries for tables will NOT include views in results; those that need to will have
to be modified to reference base class instead
| Existing queries for tables WILL include views in results; those that are not supposed to
will need to filter them out
|-
! Metastore upgrade on deployment
| Need to test carefully to make sure introducing inheritance doesn't corrupt existing metastore
instances
| Nothing special, just adding a new attribute
|}

== Dependency Tracking ==

It's necessary to track dependencies from a view to objects it references in the metastore:

* tables:  this is mandatory if we want DROP TABLE to be able to correctly CASCADE/RESTRICT
to a referencing view
* other views:  same as tables
* columns:  this is optional (useful for lineage inspection, but not required for implementing
SQL features)
* temporary functions:  we should disallow these at view creation unless we also want a concept
of temporary view (or if it's OK for the referencing view to become invalid whenever the volatile
function registry gets cleared)
* any other objects? (e.g. udt's coming in as part of [http://issues.apache.org/jira/browse/HIVE-779
HIVE-779])

(Note that MySQL doesn't actually implement CASCADE/RESTRICT:  it just ignores the keyword
and drops the table unconditionally, leaving the view dangling.)

Metastore object id's can be used for dependency modeling in order to avoid the need to update
dependency records when an object is renamed.  However, we'll need to decide what kinds of
objects can participate in dependencies.  For example, if we restrict it to just tables and
views (and assuming we don't introduce inheritance for views), then we can use a model like
the one below, in which the dependencies are tracked as (supplier,consumer) table pairs. 
(In this model, the TableDependency class is acting as an intersection table for implementing
a many-to-many relationship between suppliers and consumers).

[[Image:Data--Hive--ViewDev--HiveTableDependency.png]]

However, if later we want to introduce persistent functions, or track column dependencies,
this model will be insufficient, and we may need to introduce inheritance, with a DependencyParticipant
base class from which tables, columns, functions etc all derive.  (Again, need to verify that
JDO inheritance will actually support what we want here.)

== Dependency Invalidation ==

What happens when an object is modified underneath a view?  For example, suppose a view references
a table's column, and then ALTER TABLE is used to drop or replace that column.  Note that
if the column's datatype changes, the view definition may remain meaningful, but the view's
schema may need to be updated to match.  Here are two possible options:

* <b>Strict</b>:  prevent operations which would invalidate or change the view
in any way (and optionally to provide a CASCADE flag which requests that such views be dropped
automatically).  This is the approach taken by SQL:200n.
* <b>Lenient</b>: allow the update to proceed (and maybe warn the user of the
impact), potentially leaving the view in an invalid state.  Later, when an invalid view definition
is referenced, throw a validation exception for the referencing query.  This is the approach
taken by MySQL.  In the case of datatype changes, derived column datatypes already stored
in metastore for referencing views would become stale until those views were recreated.

Note that besides table modifications, other operations such as CREATE OR REPLACE VIEW have
similar issues (since views can reference other views).  The lenient approach provides a reasonable
solution for the related issue of external tables whose schemas may be dynamic (not sure if
we currently support this).

== View Modification ==

In SQL:200n, there's no standard way to update a view definition.  MySQL supports both

* <b>CREATE OR REPLACE VIEW v AS new-view-def-select</b>
* <b>ALTER VIEW v AS new-view-def-select</b>

Note that supporting view modification requires detection of cyclic view definitions, which
should be invalid.  Whether this detection is carried out at the time of view modification
versus reference is dependent on the strict versus lenient approaches to dependency invalidation
described above.

== Fast Path Execution ==

For <b>select * from t</b>, hive supports fast-path execution (skipping Map/Reduce).
 Is it important for this to work for <b>select * from v</b> as well?

== ORDER BY and LIMIT in view definition ==

SQL:200n prohibits ORDER BY in a view definition, since a view is supposed to be a virtual
(unordered) table, not a query alias.  However, many DBMS's ignore this rule; for example,
MySQL allows ORDER BY, but ignores it in the case where it is superceded by an ORDER BY in
the query.  Should we prevent ORDER BY?  This question also applies to the LIMIT clause.

Mime
View raw message