pig-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Pig Wiki] Update of "Howl/HowlImportExport" by AlanGates
Date Mon, 06 Dec 2010 22:23:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "Howl/HowlImportExport" page has been changed by AlanGates.
http://wiki.apache.org/pig/Howl/HowlImportExport

--------------------------------------------------

New page:
This document specifies the use cases, requirements, and design for Howl import and export
format.  The purpose of this feature is to allow users to move data into, out of, and between
Howl instances while still using Howl tools.

== Definition of Terms ==

 HIEOutputFormat:: Howl Import Export !OutputFormat, used to create a data collection when
Howl's metastore is not present.
 HIEInputFormat:: Howl Import Export !InputFormat, used to read a data collection when Howl's
metastore is not present.
 HIELoader:: Howl Import Export Loader used to read a data collection when Howl's metastore
is not present.
 HIEStorage:: Howl Import Export Store Function used to create a data collection when Howl's
metastore is not present.
 data collection:: All of the data and metadata created by HIEOutputFormat, HIEStorage, or
Howl's export command.  Note that unlike many software archives (e.g. tar) this collection
is all of the files contained under one directory, not one file itself.

== Use Cases ==
=== Importing Data ===

A user creates data on a grid or location where access to the Howl metastore is not available,
but he wishes to store the data in such a way that it can easily be imported into Howl at
a later time.  In Map Reduce, the user uses HIEOutputFormat to store the data, providing information
on the file system location to store the data, the storage format to use (i.e. text, !RCFile),
the schema of the data, how the data is sorted, and optionally the name of the intended Howl
table and values of partition keys for the data (that is, if the destination table for this
data is partitioned on datestamp, and this data will eventually be the partition for 20101109
then key of `datestamp` value of `20101109` could be included as information with this data).
 In Pig, the user uses HIEStorage, specifying the same information except for the schema and
sort order, which Pig will already know.

At a later time the user (or another user) can then import the data into Howl using the Howl
CLI IMPORT command, providing the name of the table to import the data to and values for all
partition keys for that table if they are not specified in the data.  If the data needs to
be transported before import, hadoop's distcp utility can be used for grid to grid moves or
`hadoop fs -copyFromLocal` can be used for local to grid.  If the specified table does not
yet exist, Howl will create it, using the information in the imported metadata to set defaults
for the table such as schema, storage format, etc.

=== Exporting Data ===
A user wishes to take data from Howl and use it on a grid or in a location where access to
the Howl metastore is not available.  Using the Howl CLI EXPORT command the user can export
the data, which will result in a data collection that includes metadata on the data's storage
format, the schema, how the data is sorted, what table the data came from, and values of any
partition keys from that table.  If the user needs to move the data to another grid Hadoop's
distcp utility can be used.  If the user needs to copy it from the grid to a local file system
`hadoop fs -copyToLocal` can be used.

At a later time the user (or another user) can read the data using HIEInputFormat or HIELoader
directly.  These will require the file system location where the data collection resides.

=== Transfer Between Howl Instances ===
A user wishes to move data between grids with Howl, preserving the table and partition structure.
 The user can use the Howl CLI to export Howl data, distcp to move the data between grids,
and then the Howl CLI to import the data on the new grid.

==== Howless Howl ====
A user wishes to store metadata with data while not having a Howl metastore.  In this case
data can be created by HIEOutputFormat or HIEStorage and read by HIEInputFormat or HIELoader.

== Requirements ==

 1. Using !HEIOutputFormat or !HEIStorage the user will be required to provide a file system
location to store the data collection, a schema, whether the data is sorted and by what key,
and the storage format to use to store the file.
 1. Using !HEIOutputFormat or !HEIStorage the user will be able but not required to provide
an intended import table and values for partition keys.
 1. Initially data collections created by !HEIOutputFormat, !HEIStorage, and the Howl CLI
export command will be limited to one partition of a table.  However, there should not be
anything in the design that prevents us from extending these to allow multiple partitions
once Howl supports dynamic partition creation.
 1. These tools will not know the locations of jars containing the required storage formats
to read or write data.  It will be the responsibility of the user to provide these jars in
the classpath of their MR job or Pig invocation.  For example, if the data is to be stored
in !RCFile, the user must make sure the appropriate jars are in the classpath of the job creating
the data collection.
 1. If a data collection is imported into Howl with no table name, the user doing the import
will be required to provide the name of the table to import the data into.
 1. If a data collection is imported into a Howl table that is partitioned but the data collection
provides no partition key values, the user doing the import will be required to provide values
for each of the partition keys.  In this case, even in the future, import will be limited
to a single partition.
 1. If the data collection specifies a table name but that table does not exist, upon import
Howl will create that table using the storage format, schema, and sort information from the
archive.
 1. If the data collection specifies a table name and partition key values that duplicate
existing key values (for example the value for the datestamp key is `20101109` but a partition
for the table already exists with that datestamp) then the import utility will return an error
and not do the import.
 1. If the data collection specifies a table name and partition keys that are incorrect or
incomplete (for example the table requires datestamp and region keys and only datestamp is
provided, or datestamp and language are provided) then an error will be returned and the import
will not be done.
 1. If the data collection has a schema that does not match the schema of the table being
imported to an error will be returned and the import will not be done.
 1. If the data collection has a sort order that does not match the sort order of the table
being imported to an error will be returned and the import will not be done.
 1. Users must have write permission on a table to import data into a table.
 1. Users must have read permission on a table to export data from a table.
 1. Data to be exported will be copied to a new directory so that the user can operate on
that directory without affecting the data in Howl.
 1. All data placed into a collection by HIEOutputFormat, HIEStorage, or export will be located
under one directory for easy use with distcp and copyFromLocal, copyToLocal.
 1. All metadata stored by !HEIOutputFormat, !HEIstorage, and the CLI export command must
be in a file that begins with underscore (so that Hadoop jobs will ignore it) and located
in the same directory as the data (so that it will be moved by distcp, copyToLocal, and copyFromLocal
along with the data).
 1. Exporting the data will require a valid HDFS location to export the data too.
 1. Exported metadata will contain full table name and partition information.
 1. HIEInputFormat and HIELoader will require the file system location of the data.
 1. HIEInputFormat and HIELoader will not know the classpath of the underlying !InputFormat
and storage handler.  It will be the job of the user to provide the jars for these in the
classpath.
 1. HIEInputFormat, HIELoader, HIEOutputFormat, and HIEStorage will be included in the Howl
client jar.
 1. As Hive by definition only works on tables that are known to the metastore, there is no
HIESerDe nor any requirement for these data collections to work with Hive outside of Howl.

== Design ==

=== Interface ===

==== HIEInputFormat ====
TBD - should look as much like !HowlInputFormat as possible.

==== HIEOutputFormat ====
TBD - should look as much like !HowlOutputFormat as possible.

==== HIELoader ====
TBD - should look as much like !HowlLoader as possible.

==== HIEStorage ====
TBD - should look as much like !HowlStorage as possible.

==== Changes to Howl CLI ====
New `IMPORT` syntax - TBD

New `EXPORT` syntax - TBD

=== Implementation ===
Re-use as much of !HowlInputFormat, !HowlOutputFormat, !HowlLoader, and !HowlStorage as possible,
including storage drivers.  Hopefully this can be done by creating a super class of both and
having just the metadata access be different between the two.

All metadata should be stored in an underscore file in the same directory as the data.

== Open Questions ==
 1. Originally we talked about this converting formats on import (if they didn't match the
table's default) and export (if the user requested it).  I have not included that in this.
 At the very least I'd like to put it off until a later version.  It's not 100% clear to me
we should support it at all, as it requires Howl to start independent MR jobs, which we may
not want.

== Comments ==

=== Olga ===

 * I think we should consider compressing the data we are exporting unless it is already compressed
as the case might be with RCFile data.
 * I would imagine that the most common case of import/export (other than moving to a different
grid) would be in the text (!PigStorage) format. Since Howl v1 would support this format,
making this work would not be an issue. However, I am wondering if storing it as text within
Howl is the right approach given that it might not be the most efficient way. I think we need
to understand this use case a bit better. This is related to the Open Question #1.

Mime
View raw message