hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang
Date Thu, 24 Dec 2009 01:02:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=16&rev2=17

--------------------------------------------------

  = Concepts =
  
  == What is Hive ==
- Hive is the next generation infrastructure made with the goal of providing tools to enable
easy data summarization, adhoc querying and analysis of detail data. In addition it also provides
a simple query language called QL which is based on SQL and which enables users familiar with
SQL to do ad-hoc querying, summarization and data analysis. At the same time, this language
also allows traditional map/reduce programmers to be able to plug in their custom mappers
and reducers to do more sophisticated analysis which may not be supported by the built in
capabilities of the language. 
+ Hive is the next generation infrastructure designed with the goals of providing data processing
systems to enable easy data summarization, adhoc querying and analysis of large volumes of
data. In addition it also provides a simple query language called QL, which is based on SQL
and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis
easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able
to plug in their custom mappers and reducers to do more sophisticated analysis that may not
be supported by the built-in capabilities the language. 
  
  == What is NOT Hive ==
- Hive is based on Hadoop, which is a batch processing system. Accordingly, this system does
not and cannot promise low latencies on queries. The paradigm here is strictly of submitting
jobs and being notified when the jobs are completed as opposed to real time queries. As a
result it should not be compared with systems like Oracle where analysis is done on a significantly
smaller amount of data but the analysis proceeds much more iteratively with the response times
between iterations being less than a few minutes. For Hive queries response times for even
the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run
into hours.
+ Hive is based on Hadoop, which is a batch processing system. Accordingly, this system does
not and cannot promise low latencies on queries. The paradigm here is strictly of submitting
jobs and being notified when the jobs are completed as opposed to real time queries. As a
result it should not be compared with systems such as Oracle where analyses are conducted
on a significantly smaller amount of data but the analyses proceed much more iteratively with
the response times between iterations being less than a few minutes. A typical Hive query's
response time is usually greater than a couple of minutes. For large jobs they may even run
into hours. What Hive provides is a fault-tolerant and an scale-out option, where more commodity
boxes can be added to the Hadoop cluster as the data size and/or workload increases and Hive
will automatically benefit from that. 
  
- In the following sections we provide a tutorial on the capabilities of the system. We start
by describing the concepts of data types, tables and partitions (which are very similar to
what you would find in a traditional relational database) and then illustrate the capabilities
of the language with the help of some examples
+ In the following sections we provide a tutorial on the capabilities of the system. We start
by describing the concepts of data types, tables and partitions (which are very similar to
what you would find in a traditional relational DBMS) and then illustrate the capabilities
of the QL language with the help of some examples.
  
  == Data Units ==
- In order of granularity - Hive data is organized into:
+ In the order of granularity - Hive data is organized into:
- 
+     * Databases: Namespaces that separate tables and other data units from naming confliction.

-     * Tables: Homogeneous units of data which have the same schema. An example of a table
could be page_views table. Each row of this table could comprise of the following columns,
which define the schema of the table:
+     * Tables: Homogeneous units of data which have the same schema. An example of a table
could be page_views table, where each row could comprise of the following columns (schema):
-           * timestamp - which is an int that corresponds to a unix timestamp of when the
page was viewed.
+           * timestamp - which is of INT type that corresponds to a unix timestamp of when
the page was viewed.
-           * userid - which is a bigint (see Primitive Types) identifying the user who viewed
the page.
+           * userid - which is of BIGINT type that identifies the user who viewed the page.
-           * page_url - which is a string that captures the location of the page.
+           * page_url - which is of STRING type that captures the location of the page.
-           * referer_url - which is a string that captures the location of the page from
where we arrived at the current page.
+           * referer_url - which is of STRING that captures the location of the page from
where the user arrived at the current page.
-           * IP - which is a string that captures the IP address from where the page request
was made.
+           * IP - which is of STRING type that captures the IP address from where the page
request was made.
-     * Partitions: Each Table can have one or more partition Keys which determines how the
data is stored. Partitions - apart from being storage units - also allow the user to efficiently
identify the rows that satisfy a certain criteria. e.g a date_partition of type Datestamp
and country_partition of type String. Each unique specification of the partition keys defines
a partition of the Table e.g. all "US" data from "2008-02-02" is a partition of the page_views
table. Therefore, if you have to run analysis on only the "US" data for 2008-02-02, you can
run that analysis only on the relevant partition of the table thereby speeding up the analysis
significantly. (Note, however, that just because a partition is named 2008-02-02 does not
mean that it contains all or only data from that date; partitions are named after dates for
convenience but it is the user's job to guarantee the relationship between partition name
and data content!). Partition columns are virtual columns, they are not part of the data itself
but are derived on load.
+     * Partitions: Each Table can have one or more partition Keys which determines how the
data is stored. Partitions - apart from being storage units - also allow the user to efficiently
identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING
and country_partition of type STRING. Each unique value of the partition keys defines a partition
of the Table. For example all "US" data from "2009-12-23" is a partition of the page_views
table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that
query only on the relevant partition of the table thereby speeding up the analysis significantly.
Note however, that just because a partition is named 2009-12-23 does not mean that it contains
all or only data from that date; partitions are named after dates for convenience but it is
the user's job to guarantee the relationship between partition name and data content!). Partition
columns are virtual columns, they are not part of the data itself but are derived on load.
-     * Buckets (Cluster) : Data in each partition may in turn be divided into Buckets based
on a hash of some column of the Table. For example the page_views table may be bucketed by
userid (which is one of the columns of the page_view table, unlike partitions column). These
can be used to efficiently sample the data.
+     * Buckets (or Cluster) : Data in each partition may in turn be divided into Buckets
based on the value of a hash function of some column of the Table. For example the page_views
table may be bucketed by userid, which is one of the columns, other than the partitions columns,
of the page_view table. These can be used to efficiently sample the data.
  
  Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions
allow the system to prune large quantities of data during query processing, resulting in faster
query execution. 
  
@@ -36, +36 @@

          * SMALLINT - 2 byte integer
          * INT - 4 byte integer
          * BIGINT - 8 byte integer
+     * Boolean type
+         * BOOLEAN - TRUE/FALSE
      * Floating point numbers
+         * FLOAT - single precision 
          * DOUBLE - Double precision
      * String type
          * STRING - sequence of characters in a specified character set
@@ -45, +48 @@

  
      Type 
         |→Primitive Type 
-                  |→Numbers 
+                  |→Number 
-                          |→Double 
+                          |→DOUBLE 
                                 |→BIGINT 
                                        |→INT 
-                                             |→TINYINT 
+                                             |→TINYINT
+                                |→FLOAT 
-                  |→Strings 
+                  |→STRING 
+                  |→BOOLEAN 
         |→Complex Type 
- This type hierarchy defines how the types are implicitly converted in the query language.
Implicit conversion is allowed for types from child to an ancestor. So when a query expression
expects type1 and the data is of type2 type2 is implicitly converted to type1 if type1 is
an ancestor of type2 in the type hierarchy. Apart from this fundamental rule for implicit
conversion, implicit conversion is also allowed for the following cases:
+ This type hierarchy defines how the types are implicitly converted in the query language.
Implicit conversion is allowed for types from child to an ancestor. So when a query expression
expects type1 and the data is of type2 type2 is implicitly converted to type1 if type1 is
an ancestor of type2 in the type hierarchy. Apart from these fundamental rules for implicit
conversion based on type system, Hive also allows the special case for conversion:
  
-     * String to Double 
+     * STRING → DOUBLE
  
- Explicit type conversion can be done using the cast operator as shown in the Table of '''
Built in Functions ''' section below.
+ Explicit type conversion can be done using the cast operator as shown in the [[Built in
Functions|Built in Functions]] section below.
  
  === Complex Types ===
  Complex Types can be built up from primitive types and other composite types using:
  
-     * Structs: the elements within the type can be accessed using the . notation e.g. for
a column c of type struct {a int; b int} the a field is accessed by the expression a.c
+     * Structs: the elements within the type can be accessed using the DOT (.) notation.
For example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression
a.c
-     * Maps (key-value tuples): The elements are accessed using ['element name'] notation
e.g. in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed
using M['group']
+     * Maps (key-value tuples): The elements are accessed using ['element name'] notation.
For example in a map M comprising of a mapping from 'group' -> gid the gid value can be
accessed using M['group']
-     * Arrays (indexable lists): The elements are accessed using the [n] notation where n
is an index into the array e.g. for an array A having the elements ['a', 'b', 'c'], A[1] retruns
'b'. The index starts from 0. 
+     * Arrays (indexable lists): The elements in the array have to be in the same type. Elements
can be accessed using the [n] notation where n is an index (zero-based) into the array. For
example for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'. 
  
- Using the primitive types and the constructs for creating complex types, types with arbitrary
levels of nesting can be created.
+ Using the primitive types and the constructs for creating complex types, types with arbitrary
levels of nesting can be created. For example, a type User may comprise of the following fields:
  
- e.g. a type User may comprise of the following fields:
- 
-     * id - which is a 4 byte integer.
+     * id - which is a 4-byte INT.
-     * name - which is a string.
+     * name - which is a STRING.
-     * age - which is an integer.
+     * age - which is an 1-byte TINYINT.
-     * weight - which is a floating point number.
+     * weight - which is a FLOAT number.
-     * friends - which is a array of ids(integers).
+     * friends - which is a array of ids (BIGINT).
-     * gender - which is an integer.
+     * gender - which is a STRING.
-     * active - which is a boolean. 
+     * active - which is a BOOLEAN. 
- 
- '''The tables with columns that are an instance of a complex type can only be created programmatically
and NOT through hive command line at this time'''. We will be adding ability to add such tables
through the hive command line in the future. 
  
  == Built in operators and functions ==
  === Built in operators ===
      *Relational Operators - The following operators compare the passed operands and generate
a TRUE or FALSE value depending on whether the comparison between the operands holds or not.

   '''Relational Operators'''
- || Operator || Operand types ||  Description ||
+ || '''Operator''' || '''Operand types''' ||  '''Description''' ||
- || A = B || all primitive types || TRUE if expression A is equal to expression B otherwise
FALSE||
+ || A = B || all primitive types || TRUE if expression A is equal to expression B otherwise
FALSE. Note: SQL uses = but not ==||||
- || A == B || none! || Fails; SQL uses = and not ==!||
  || A <> B || all primitive types || TRUE if expression A is NOT equal to expression
B otherwise FALSE||
  || A < B || all primitive types || TRUE if expression A is  less than expression B otherwise
FALSE||
  || A <= B || all primitive types || TRUE if expression A is less than or equal to expression
B otherwise FALSE||
  || A > B || all primitive types || TRUE if expression A is greater than expression B
otherwise FALSE||
  || A >= B || all primitive types || TRUE if expression A is greater than or equal to
expression B otherwise FALSE||
  || A IS NULL || all types || TRUE if expression A evaluates to NULL otherwise FALSE||
- || A IS NOT NULL || all types || TRUE if expression A evaluates to NULL otherwise FALSE||
+ || A IS NOT NULL || all types || FALSE if expression A evaluates to NULL otherwise TRUE||
- || A LIKE B || strings || TRUE if string A matches the SQL simple regular expression B,
otherwise FALSE. The comparison is done character by character. The _ character in B matches
any character in A(similar to . in posix regular expressions) while the % character in B matches
an arbitrary number of characters in A(similar to .* in posix regular expressions) e.g. 'foobar'
like 'foo' evaluates to FALSE where as 'foobar' like {{{'foo___'}}} evaluates to TRUE and
so does 'foobar' like 'foo%'||
+ || A LIKE B || strings || TRUE if string A matches the SQL simple regular expression B,
otherwise FALSE. The comparison is done character by character. The _ character in B matches
any character in A (similar to . in posix regular expressions), and the % character in B matches
an arbitrary number of characters in A (similar to .* in posix regular expressions). For example,
'foobar' like 'foo' evaluates to FALSE where as 'foobar' like {{{'foo___'}}} evaluates to
TRUE and so does 'foobar' like 'foo%'||
- || NOT A LIKE B || strings || TRUE if string A not matches the SQL simple regular expression
B, otherwise FALSE||
- || A RLIKE B || strings ||  TRUE if string A matches the Java regular expression B(See [[http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html|Java
regular expressions syntax]]), otherwise FALSE e.g. 'foobar' rlike 'foo' evaluates to FALSE
where as 'foobar' rlike '^f.*r$' evaluates to TRUE||
+ || A RLIKE B || strings ||  TRUE if string A matches the Java regular expression B (See
[[http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html|Java regular expressions
syntax]]), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to FALSE whereas 'foobar'
rlike '^f.*r$' evaluates to TRUE||
  || A REGEXP B || strings || Same as RLIKE ||
   
      *Arithmetic Operators - The following operators support various common arithmetic operations
on the operands. All of them return number types. 

Mime
View raw message