hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/Design" by JeffHammerbacher
Date Thu, 22 Jan 2009 04:11:34 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JeffHammerbacher:

   * Tables - These are analogous to Tables in Relational Databases. Tables can be filtered,
projected, joined and unioned. Additionally all the data of a table is stored in a directory
in hdfs. Hive also supports notion of external tables wherein a table can be created on prexisting
files or directories in hdfs by providing the appropriate location to the table creation DDL.
The rows in a table are organized into typed columns similar to Relational Databases.
   * Partitions - Each Table can have one or more partition keys which determine how the data
is stored e.g. a table T with a date partition column ds had files with data for a particular
date stored in the <table location>/ds=<date> directory in hdfs. Partitions allow
the system to prune data to be inspected based on query predicates, e.g. a query that in interested
in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files
in <table location>/ds=2008-09-01/ directory in hdfs.
   * Buckets - Data in each partition may in turn be divided into Buckets based on the hash
of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing
allows the system to efficiently evaluate queries that depend on a sample of data (these are
queries that use SAMPLE clause on the table).
- \end{itemize}
  Apart from primitive column types(integers, floating point numbers, generic strings, dates
and booleans), Hive also supports arrays and maps. Additionally, users can compose their own
types programatically from any of the primitives, collections or other user defined types.
The typing system is closely tied to the serde(Serailization/Deserialization) and object inspector
interfaces. User can create their own types by implementing their own object inspectors and
using these object inspectors they can create their own serdes to serialize and deserialize
their data into hdfs files). These two interfaces provide the necessary hooks to extend the
capabilities of Hive when it comes to understanding other data formats and richer types. Builtin
object inspectors like ListObjectInspector, StructObjectInspector and MapObjectInspector provide
the necessary primitives to compose richer types in an extensible manner. For maps(associative
arrays) and arrays useful builtin functions like 
 size and index operators are provided. The dotted notation is used to navigate nested types
e.g. a.b.c = 1 looks at field c of field b of type a and compares that with 1.

View raw message