hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/Design" by RaghothamMurthy
Date Thu, 22 Jan 2009 01:41:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by RaghothamMurthy:

+ [[TableOfContents]]
  == Hive Architecture ==
  Figure \ref{fig:sys_arch} shows the major components of Hive and its interactions with Hadoop.
As shown in that figure, the main components of Hive are: 
   * UI - The user interface for users to submit queries and other operations to the system.
Currently the system has a command line interface and a web based GUI is being developed.
@@ -20, +21 @@

  Apart from primitive column types(integers, floating point numbers, generic strings, dates
and booleans), Hive also supports arrays and maps. Additionally, users can compose their own
types programatically from any of the primitives, collections or other user defined types.
The typing system is closely tied to the serde(Serailization/Deserialization) and object inspector
interfaces. User can create their own types by implementing their own object inspectors and
using these object inspectors they can create their own serdes to serialize and deserialize
their data into hdfs files). These two interfaces provide the necessary hooks to extend the
capabilities of Hive when it comes to understanding other data formats and richer types. Builtin
object inspectors like ListObjectInspector, StructObjectInspector and MapObjectInspector provide
the necessary primitives to compose richer types in an extensible manner. For maps(associative
arrays) and arrays useful builtin functions like 
 size and index operators are provided. The dotted notation is used to navigate nested types
e.g. a.b.c = 1 looks at field c of field b of type a and compares that with 1.
- == Meta Store ==
+ == Metastore ==
  === Motivation ===
  Meta Store store provides two important but often over looked features of a data warehouse:
data abstraction and data discovery. Without the data abstractions provided in Hive, user
has to provide information about data formats, exractors and loaders along with the query.
In Hive, this information given during table creation and reused everytime the table is referenced.
This is very similar to the traditional warehousing systems. The second functionality, data
discovery, enables users to discover and explore relevant and specific data in the warehouse.
Other tools can be built using this metadata to expose and possibly enhance the information
about the data and its availability. Hive accomplishes both of these features by providing
a metdata repository that is tightly integrated with the Hive query processing system so that
data and metadata are in sync.
@@ -29, +30 @@

   * Table - Metadata for table contains list of columns, owner, storage and SerDe information.
It can also contain any user supplied key and value data. Storage information includes location
of the underlying data, file inout and output formats and bucketing information. SerDe metadata
includes the implementation class of serializer and deserializer and any supporting information
required by the implementation. All of these information can be provided during the creation
of table.
   * Partition - Each partition can have its own columns and SerDe and storage information.
This facilitates schema changes without affecting older partitions.
- === Meta Store Architecture ===
+ === Metastore Architecture ===
  Metastore is an object store with a database or file backed store. The database backed store
is implemented using ORM solution\cite{jpox}. The prime motivation for storting this in a
relational database is queriability of metad data. Some disadvantages of using a separate
data store for metadata instead using HDFS are synchronization and scalability issues. Additionally
there is no clear way to implement an object store on top of HDFS due to lack of random updates
to files. Coupled with this and the advantages of queriability of relational store made our
approach a sensible one.
  Meta Store can be configured to be used in couple of ways: remote and embedded. In remote
mode, meta store is a Thrift\cite{thrift} service. This mode is useful for non-Java clients.
In embedded mode, Hive client directly connects to underlying meta store using JDBC. This
mode is useful because it avoids another system that needs to be maintained and monitored.
Both of these modes can co-exist.
+ === Metastore Interface ===
+ Metastore provides Thrift interface\cite{msapi} to manipulate and query Hive metadata. Thrift
provides bindings in many popular languages. Third party tools can use this interface to integrate
Hive metadata into other business metadata repositories.
+ == Hive Query Language ==
+ HiveQL is an SQL-like query language for Hive. It mostly mimics SQL syntax for creation
+ of tables, loading data into tables and querying the tables. HiveQL also allows
+ users to embed their custom map-reduce scripts. These scripts can be written in any language
+ using a simple row-based streaming interface -- read rows from standard input and write
+ rows to standard output. This flexibility comes at a cost of a performance hit caused by
+ converting rows from and to strings. However, we have seen that users do not mind this given
+ that they can implement their scripts in the language of their choice. Another feature
+ unique to HiveQL is multi-table insert. In this construct, users can perform multiple queries
+ on the same input data using a single HiveQL query. Hive optimizes these queries to share
+ the scan of the input data, thus increasing the throughput of these queries several orders
+ of magnitude. We omit more details due to lack of space.  For a more complete
+ description of the HiveQL language see the [wiki:Self:Hive/LanguageManual language manual].

View raw message