Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
From: Apache Wiki <wikidiffs@apache.org>
To: Apache Wiki <wikidiffs@apache.org>
Date: Thu, 24 Dec 2009 04:59:33 -0000
Message-ID: <20091224045933.26831.1519@eos.apache.org>
Subject: 
 =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Hive/Tutorial=22_by_Ning_Zhang?=

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch=
ange notification.

The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=3Ddiff&rev1=3D17&rev2=3D=
18

--------------------------------------------------

  #pragma section-numbers 1
- =

  <<TableOfContents(3)>>
  =

  =3D Concepts =3D
- =

  =3D=3D What is Hive =3D=3D
- Hive is the next generation infrastructure designed with the goals of pro=
viding data processing systems to enable easy data summarization, adhoc que=
rying and analysis of large volumes of data. In addition it also provides a=
 simple query language called QL, which is based on SQL and which enables u=
sers familiar with SQL to do ad-hoc querying, summarization and data analys=
is easily. At the same time, Hive QL also allows traditional map/reduce pro=
grammers to be able to plug in their custom mappers and reducers to do more=
 sophisticated analysis that may not be supported by the built-in capabilit=
ies the language. =

+ Hive is the next generation infrastructure designed with the goals of pro=
viding data processing systems to enable easy data summarization, adhoc que=
rying and analysis of large volumes of data. In addition it also provides a=
 simple query language called QL, which is based on SQL and which enables u=
sers familiar with SQL to do ad-hoc querying, summarization and data analys=
is easily. At the same time, Hive QL also allows traditional map/reduce pro=
grammers to be able to plug in their custom mappers and reducers to do more=
 sophisticated analysis that may not be supported by the built-in capabilit=
ies the language.
  =

  =3D=3D What is NOT Hive =3D=3D
- Hive is based on Hadoop, which is a batch processing system. Accordingly,=
 this system does not and cannot promise low latencies on queries. The para=
digm here is strictly of submitting jobs and being notified when the jobs a=
re completed as opposed to real time queries. As a result it should not be =
compared with systems such as Oracle where analyses are conducted on a sign=
ificantly smaller amount of data but the analyses proceed much more iterati=
vely with the response times between iterations being less than a few minut=
es. A typical Hive query's response time is usually greater than a couple o=
f minutes. For large jobs they may even run into hours. What Hive provides =
is a fault-tolerant and an scale-out option, where more commodity boxes can=
 be added to the Hadoop cluster as the data size and/or workload increases =
and Hive will automatically benefit from that. =

+ Hive is based on Hadoop, which is a batch processing system. Accordingly,=
 this system does not and cannot promise low latencies on queries. The para=
digm here is strictly of submitting jobs and being notified when the jobs a=
re completed as opposed to real time queries. As a result it should not be =
compared with systems such as Oracle where analyses are conducted on a sign=
ificantly smaller amount of data but the analyses proceed much more iterati=
vely with the response times between iterations being less than a few minut=
es. A typical Hive query's response time is usually greater than a couple o=
f minutes. For large jobs they may even run into hours. What Hive provides =
is a fault-tolerant and an scale-out option, where more commodity boxes can=
 be added to the Hadoop cluster as the data size and/or workload increases =
and Hive will automatically benefit from that.
  =

  In the following sections we provide a tutorial on the capabilities of th=
e system. We start by describing the concepts of data types, tables and par=
titions (which are very similar to what you would find in a traditional rel=
ational DBMS) and then illustrate the capabilities of the QL language with =
the help of some examples.
  =

  =3D=3D Data Units =3D=3D
  In the order of granularity - Hive data is organized into:
+ =

-     * Databases: Namespaces that separate tables and other data units fro=
m naming confliction. =

+  * Databases: Namespaces that separate tables and other data units from n=
aming conflicts.
-     * Tables: Homogeneous units of data which have the same schema. An ex=
ample of a table could be page_views table, where each row could comprise o=
f the following columns (schema):
+  * Tables: Homogeneous units of data which have the same schema. An examp=
le of a table could be page_views table, where each row could comprise of t=
he following columns (schema):
-           * timestamp - which is of INT type that corresponds to a unix t=
imestamp of when the page was viewed.
+   * timestamp - which is of INT type that corresponds to a Unix time stam=
p of when the page was viewed.
-           * userid - which is of BIGINT type that identifies the user who=
 viewed the page.
+   * userid - which is of BIGINT type that identifies the user who viewed =
the page.
-           * page_url - which is of STRING type that captures the location=
 of the page.
+   * page_url - which is of STRING type that captures the location of the =
page.
-           * referer_url - which is of STRING that captures the location o=
f the page from where the user arrived at the current page.
+   * referer_url - which is of STRING that captures the location of the pa=
ge from where the user arrived at the current page.
-           * IP - which is of STRING type that captures the IP address fro=
m where the page request was made.
+   * IP - which is of STRING type that captures the IP address from where =
the page request was made.
-     * Partitions: Each Table can have one or more partition Keys which de=
termines how the data is stored. Partitions - apart from being storage unit=
s - also allow the user to efficiently identify the rows that satisfy a cer=
tain criteria. For example, a date_partition of type STRING and country_par=
tition of type STRING. Each unique value of the partition keys defines a pa=
rtition of the Table. For example all "US" data from "2009-12-23" is a part=
ition of the page_views table. Therefore, if you run analysis on only the "=
US" data for 2009-12-23, you can run that query only on the relevant partit=
ion of the table thereby speeding up the analysis significantly. Note howev=
er, that just because a partition is named 2009-12-23 does not mean that it=
 contains all or only data from that date; partitions are named after dates=
 for convenience but it is the user's job to guarantee the relationship bet=
ween partition name and data content!). Partition columns are virtual colum=
ns, they are not part of the data itself but are derived on load.
+  * Partitions: Each Table can have one or more partition Keys which deter=
mines how the data is stored. Partitions - apart from being storage units -=
 also allow the user to efficiently identify the rows that satisfy a certai=
n criteria. For example, a date_partition of type STRING and country_partit=
ion of type STRING. Each unique value of the partition keys defines a parti=
tion of the Table. For example all "US" data from "2009-12-23" is a partiti=
on of the page_views table. Therefore, if you run analysis on only the "US"=
 data for 2009-12-23, you can run that query only on the relevant partition=
 of the table thereby speeding up the analysis significantly. Note however,=
 that just because a partition is named 2009-12-23 does not mean that it co=
ntains all or only data from that date; partitions are named after dates fo=
r convenience but it is the user's job to guarantee the relationship betwee=
n partition name and data content!). Partition columns are virtual columns,=
 they are not part of the data itself but are derived on load.
-     * Buckets (or Cluster) : Data in each partition may in turn be divide=
d into Buckets based on the value of a hash function of some column of the =
Table. For example the page_views table may be bucketed by userid, which is=
 one of the columns, other than the partitions columns, of the page_view ta=
ble. These can be used to efficiently sample the data.
+  * Buckets (or Cluster) : Data in each partition may in turn be divided i=
nto Buckets based on the value of a hash function of some column of the Tab=
le. For example the page_views table may be bucketed by userid, which is on=
e of the columns, other than the partitions columns, of the page_view table=
. These can be used to efficiently sample the data.
  =

- Note that it is not necessary for tables to be partitioned or bucketed, b=
ut these abstractions allow the system to prune large quantities of data du=
ring query processing, resulting in faster query execution. =

+ Note that it is not necessary for tables to be partitioned or bucketed, b=
ut these abstractions allow the system to prune large quantities of data du=
ring query processing, resulting in faster query execution.
  =

  =3D=3D Type System =3D=3D
  =3D=3D=3D Primitive Types =3D=3D=3D
-     Types are associated with the columns in the tables. The following Pr=
imitive types are supported: =

+  . Types are associated with the columns in the tables. The following Pri=
mitive types are supported:
-     * Integers
+  * Integers
-         * TINYINT - 1 byte integer
+   * TINYINT - 1 byte integer
-         * SMALLINT - 2 byte integer
+   * SMALLINT - 2 byte integer
-         * INT - 4 byte integer
+   * INT - 4 byte integer
-         * BIGINT - 8 byte integer
+   * BIGINT - 8 byte integer
-     * Boolean type
+  * Boolean type
-         * BOOLEAN - TRUE/FALSE
+   * BOOLEAN - TRUE/FALSE
-     * Floating point numbers
+  * Floating point numbers
-         * FLOAT - single precision =

+   * FLOAT - single precision
-         * DOUBLE - Double precision
+   * DOUBLE - Double precision
-     * String type
+  * String type
-         * STRING - sequence of characters in a specified character set
+   * STRING - sequence of characters in a specified character set
+ =

  =3D=3D=3D Type Conversion =3D=3D=3D
  The Types are organized in the following hierarchy (where the parent is a=
 super type of all the children instances):
  =

-     Type =

+  . Type
-        |=E2=86=92Primitive Type =

+   . |=E2=86=92Primitive Type
-                  |=E2=86=92Number =

-                          |=E2=86=92DOUBLE =

-                                |=E2=86=92BIGINT =

-                                       |=E2=86=92INT =

-                                             |=E2=86=92TINYINT
-                                |=E2=86=92FLOAT =

-                  |=E2=86=92STRING =

-                  |=E2=86=92BOOLEAN =

+    . |=E2=86=92Number
+     . |=E2=86=92DOUBLE
+      . |=E2=86=92BIGINT
+       . |=E2=86=92INT
+        . |=E2=86=92TINYINT
+      |=E2=86=92FLOAT
+    |=E2=86=92STRING
+    . |=E2=86=92BOOLEAN
-        |=E2=86=92Complex Type =

+   |=E2=86=92Complex Type
+ =

  This type hierarchy defines how the types are implicitly converted in the=
 query language. Implicit conversion is allowed for types from child to an =
ancestor. So when a query expression expects type1 and the data is of type2=
 type2 is implicitly converted to type1 if type1 is an ancestor of type2 in=
 the type hierarchy. Apart from these fundamental rules for implicit conver=
sion based on type system, Hive also allows the special case for conversion:
  =

-     * STRING =E2=86=92 DOUBLE
+  * STRING =E2=86=92 DOUBLE
  =

- Explicit type conversion can be done using the cast operator as shown in =
the [[Built in Functions|Built in Functions]] section below.
+ Explicit type conversion can be done using the cast operator as shown in =
the [[#Built_in_functions|Built in Functions]] section below.
  =

  =3D=3D=3D Complex Types =3D=3D=3D
  Complex Types can be built up from primitive types and other composite ty=
pes using:
  =

-     * Structs: the elements within the type can be accessed using the DOT=
 (.) notation. For example, for a column c of type STRUCT {a INT; b INT} th=
e a field is accessed by the expression a.c
+  * Structs: the elements within the type can be accessed using the DOT (.=
) notation. For example, for a column c of type STRUCT {a INT; b INT} the a=
 field is accessed by the expression a.c
-     * Maps (key-value tuples): The elements are accessed using ['element =
name'] notation. For example in a map M comprising of a mapping from 'group=
' -> gid the gid value can be accessed using M['group']
+  * Maps (key-value tuples): The elements are accessed using ['element nam=
e'] notation. For example in a map M comprising of a mapping from 'group' -=
> gid the gid value can be accessed using M['group']
-     * Arrays (indexable lists): The elements in the array have to be in t=
he same type. Elements can be accessed using the [n] notation where n is an=
 index (zero-based) into the array. For example for an array A having the e=
lements ['a', 'b', 'c'], A[1] retruns 'b'. =

+  * Arrays (indexable lists): The elements in the array have to be in the =
same type. Elements can be accessed using the [n] notation where n is an in=
dex (zero-based) into the array. For example for an array A having the elem=
ents ['a', 'b', 'c'], A[1] retruns 'b'.
  =

  Using the primitive types and the constructs for creating complex types, =
types with arbitrary levels of nesting can be created. For example, a type =
User may comprise of the following fields:
  =

-     * id - which is a 4-byte INT.
+  * id - which is a 4-byte INT.
-     * name - which is a STRING.
+  * name - which is a STRING.
-     * age - which is an 1-byte TINYINT.
+  * age - which is an 1-byte TINYINT.
-     * weight - which is a FLOAT number.
+  * weight - which is a FLOAT number.
-     * friends - which is a array of ids (BIGINT).
+  * friends - which is a array of ids (BIGINT).
-     * gender - which is a STRING.
+  * gender - which is a STRING.
-     * active - which is a BOOLEAN. =

+  * active - which is a BOOLEAN.
  =

  =3D=3D Built in operators and functions =3D=3D
  =3D=3D=3D Built in operators =3D=3D=3D
-     *Relational Operators - The following operators compare the passed op=
erands and generate a TRUE or FALSE value depending on whether the comparis=
on between the operands holds or not. =

+  * Relational Operators - The following operators compare the passed oper=
ands and generate a TRUE or FALSE value depending on whether the comparison=
 between the operands holds or not.
+ =

-  '''Relational Operators'''
+  . '''Relational Operators'''
+ =

- || '''Operator''' || '''Operand types''' ||  '''Description''' ||
+ ||'''Operator''' ||'''Operand types''' ||'''Description''' ||
- || A =3D B || all primitive types || TRUE if expression A is equal to exp=
ression B otherwise FALSE. Note: SQL uses =3D but not =3D=3D||||
+ ||A =3D B ||all primitive types ||TRUE if expression A is equal to expres=
sion B otherwise FALSE. Note: SQL uses =3D but not =3D=3D || ||
- || A <> B || all primitive types || TRUE if expression A is NOT equal to =
expression B otherwise FALSE||
+ ||A <> B ||all primitive types ||TRUE if expression A is NOT equal to exp=
ression B otherwise FALSE ||
- || A < B || all primitive types || TRUE if expression A is  less than exp=
ression B otherwise FALSE||
+ ||A < B ||all primitive types ||TRUE if expression A is  less than expres=
sion B otherwise FALSE ||
- || A <=3D B || all primitive types || TRUE if expression A is less than o=
r equal to expression B otherwise FALSE||
+ ||A <=3D B ||all primitive types ||TRUE if expression A is less than or e=
qual to expression B otherwise FALSE ||
- || A > B || all primitive types || TRUE if expression A is greater than e=
xpression B otherwise FALSE||
+ ||A > B ||all primitive types ||TRUE if expression A is greater than expr=
ession B otherwise FALSE ||
- || A >=3D B || all primitive types || TRUE if expression A is greater tha=
n or equal to expression B otherwise FALSE||
+ ||A >=3D B ||all primitive types ||TRUE if expression A is greater than o=
r equal to expression B otherwise FALSE ||
- || A IS NULL || all types || TRUE if expression A evaluates to NULL other=
wise FALSE||
+ ||A IS NULL ||all types ||TRUE if expression A evaluates to NULL otherwis=
e FALSE ||
- || A IS NOT NULL || all types || FALSE if expression A evaluates to NULL =
otherwise TRUE||
+ ||A IS NOT NULL ||all types ||FALSE if expression A evaluates to NULL oth=
erwise TRUE ||
- || A LIKE B || strings || TRUE if string A matches the SQL simple regular=
 expression B, otherwise FALSE. The comparison is done character by charact=
er. The _ character in B matches any character in A (similar to . in posix =
regular expressions), and the % character in B matches an arbitrary number =
of characters in A (similar to .* in posix regular expressions). For exampl=
e, 'foobar' like 'foo' evaluates to FALSE where as 'foobar' like {{{'foo___=
'}}} evaluates to TRUE and so does 'foobar' like 'foo%'||
+ ||A LIKE B ||strings ||TRUE if string A matches the SQL simple regular ex=
pression B, otherwise FALSE. The comparison is done character by character.=
 The _ character in B matches any character in A (similar to . in posix reg=
ular expressions), and the % character in B matches an arbitrary number of =
characters in A (similar to .* in posix regular expressions). For example, =
'foobar' like 'foo' evaluates to FALSE where as 'foobar' like {{{'foo___'}}=
} evaluates to TRUE and so does 'foobar' like 'foo%' ||
- || A RLIKE B || strings ||  TRUE if string A matches the Java regular exp=
ression B (See [[http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pa=
ttern.html|Java regular expressions syntax]]), otherwise FALSE. For example=
, 'foobar' rlike 'foo' evaluates to FALSE whereas 'foobar' rlike '^f.*r$' e=
valuates to TRUE||
+ ||A RLIKE B ||strings ||TRUE if string A matches the Java regular express=
ion B (See [[http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Patter=
n.html|Java regular expressions syntax]]), otherwise FALSE. For example, 'f=
oobar' rlike 'foo' evaluates to FALSE whereas 'foobar' rlike '^f.*r$' evalu=
ates to TRUE ||
- || A REGEXP B || strings || Same as RLIKE ||
+ ||A REGEXP B ||strings ||Same as RLIKE ||
-  =

+ =

+ =

-     *Arithmetic Operators - The following operators support various commo=
n arithmetic operations on the operands. All of them return number types. =

+  * Arithmetic Operators - The following operators support various common =
arithmetic operations on the operands. All of them return number types.
  =

  ''' Arithmetic Operators '''
- || Operator || Operand types || Description||
+ ||Operator ||Operand types ||Description ||
- || A + B || all number types || Gives the result of adding A and B. The t=
ype of the result is the same as the common parent(in the type hierarchy) o=
f the types of the operands. e.g. since every integer is a float, therefore=
 float is a containing type of integer so the + operator on a float and an =
int will result in a float.||
+ ||A + B ||all number types ||Gives the result of adding A and B. The type=
 of the result is the same as the common parent(in the type hierarchy) of t=
he types of the operands. e.g. since every integer is a float, therefore fl=
oat is a containing type of integer so the + operator on a float and an int=
 will result in a float. ||
- || A - B || all number types || Gives the result of subtracting B from A.=
 The type of the result is the same as the common parent(in the type hierar=
chy) of the types of the operands.||
+ ||A - B ||all number types ||Gives the result of subtracting B from A. Th=
e type of the result is the same as the common parent(in the type hierarchy=
) of the types of the operands. ||
- || A * B || all number types || Gives the result of multiplying A and B. =
The type of the result is the same as the common parent(in the type hierarc=
hy) of the types of the operands. Note that if the multiplication causing o=
verflow, you will have to cast one of the operators to a type higher in the=
 type hierarchy.||
+ ||A * B ||all number types ||Gives the result of multiplying A and B. The=
 type of the result is the same as the common parent(in the type hierarchy)=
 of the types of the operands. Note that if the multiplication causing over=
flow, you will have to cast one of the operators to a type higher in the ty=
pe hierarchy. ||
- || A / B || all number types || Gives the result of dividing B from A. Th=
e type of the result is the same as the common parent(in the type hierarchy=
) of the types of the operands. If the operands are integer types, then the=
 result is the quotient of the division.||
+ ||A / B ||all number types ||Gives the result of dividing B from A. The t=
ype of the result is the same as the common parent(in the type hierarchy) o=
f the types of the operands. If the operands are integer types, then the re=
sult is the quotient of the division. ||
- || A % B || all number types || Gives the reminder resulting from dividin=
g A by B. The type of the result is the same as the common parent(in the ty=
pe hierarchy) of the types of the operands.||
+ ||A % B ||all number types ||Gives the reminder resulting from dividing A=
 by B. The type of the result is the same as the common parent(in the type =
hierarchy) of the types of the operands. ||
- || A & B || all number types || Gives the result of bitwise AND of A and =
B. The type of the result is the same as the common parent(in the type hier=
archy) of the types of the operands.||
+ ||A & B ||all number types ||Gives the result of bitwise AND of A and B. =
The type of the result is the same as the common parent(in the type hierarc=
hy) of the types of the operands. ||
- || A | B || all number types || Gives the result of bitwise OR of A and B=
. The type of the result is the same as the common parent(in the type hiera=
rchy) of the types of the operands.||
+ ||A | B ||all number types ||Gives the result of bitwise OR of A and B. T=
he type of the result is the same as the common parent(in the type hierarch=
y) of the types of the operands. ||
- || A ^ B || all number types || Gives the result of bitwise XOR of A and =
B. The type of the result is the same as the common parent(in the type hier=
archy) of the types of the operands.||
+ ||A ^ B ||all number types ||Gives the result of bitwise XOR of A and B. =
The type of the result is the same as the common parent(in the type hierarc=
hy) of the types of the operands. ||
- || ~A || all number types || Gives the result of bitwise NOT of A. The ty=
pe of the result is the same as the type of A.||
+ ||~A ||all number types ||Gives the result of bitwise NOT of A. The type =
of the result is the same as the type of A. ||
  =

+ =

+ =

+ =

-     * Logical Operators - The following operators provide support for cre=
ating logical expressions. All of them return boolean TRUE or FALSE dependi=
ng upon the boolean values of the operands.
+  * Logical Operators - The following operators provide support for creati=
ng logical expressions. All of them return boolean TRUE or FALSE depending =
upon the boolean values of the operands.
+ =

-  ''' Logical Operators '''
+  . ''' Logical Operators '''
+ =

- || A AND B || boolean || TRUE if both A and B are TRUE, otherwise FALSE||
+ ||A AND B ||boolean ||TRUE if both A and B are TRUE, otherwise FALSE ||
- || A && B || boolean || Same as A AND B||
+ ||A && B ||boolean ||Same as A AND B ||
- || A OR B || boolean || TRUE if either A or B or both are TRUE, otherwise=
 FALSE||
+ ||A OR B ||boolean ||TRUE if either A or B or both are TRUE, otherwise FA=
LSE ||
- || {{{ A || B }}} || boolean || Same as A OR B||
+ ||{{{ A || B }}} ||boolean ||Same as A OR B ||
- || NOT A || boolean || TRUE if A is FALSE, otherwise FALSE||
+ ||NOT A ||boolean ||TRUE if A is FALSE, otherwise FALSE ||
- || !A || boolean || Same as NOT A||
+ ||!A ||boolean ||Same as NOT A ||
  =

  =

  * Operators on Complex Types - The following operators provide mechanisms=
 to access elements in Complex Types
  =

  ''' Operators on Complex Types '''
- || Operator || Operand types || Description||
+ ||Operator ||Operand types ||Description ||
- || A[n] || A is an Array and n is an int || returns the nth element in th=
e array A. The first element has index 0 e.g. if A is an array comprising o=
f ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'||
+ ||A[n] ||A is an Array and n is an int ||returns the nth element in the a=
rray A. The first element has index 0 e.g. if A is an array comprising of [=
'foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' ||
- || M[key] || M is a Map<K, V> and key has type K || returns the value cor=
responding to the key in the map e.g. if M is a map comprising of {'f' -> '=
foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'||
+ ||M[key] ||M is a Map<K, V> and key has type K ||returns the value corres=
ponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo=
', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar' ||
- || S.x || S is a struct || returns the x field of S e.g for struct foobar=
 {int foo, int bar} foobar.foo returns the integer stored in the foo field =
of the struct.||
+ ||S.x ||S is a struct ||returns the x field of S e.g for struct foobar {i=
nt foo, int bar} foobar.foo returns the integer stored in the foo field of =
the struct. ||
+ =

+ =

+ =

  =

  =3D=3D=3D Built in functions =3D=3D=3D
- *The following built in functions are supported in hive:
- [[http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/h=
adoop/hive/ql/exec/FunctionRegistry.java?view=3Dmarkup|List of functions in=
 source code: FunctionRegistry.java]]
+ *The following built in functions are supported in hive: [[http://svn.apa=
che.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec=
/FunctionRegistry.java?view=3Dmarkup|List of functions in source code: Func=
tionRegistry.java]]
  =

  ''' Built in Functions '''
- || Return Type || Name(Signature) || Description||
+ ||Return Type ||Name(Signature) ||Description ||
- || BIGINT || round(double a) || returns the rounded BIGINT value of the d=
ouble||
+ ||BIGINT ||round(double a) ||returns the rounded BIGINT value of the doub=
le ||
- || BIGINT || floor(double a) || returns the maximum BIGINT value that is =
equal or less than the double||
+ ||BIGINT ||floor(double a) ||returns the maximum BIGINT value that is equ=
al or less than the double ||
- || BIGINT || ceil(double a) || returns the minimum BIGINT value that is e=
qual or greater than the double||
+ ||BIGINT ||ceil(double a) ||returns the minimum BIGINT value that is equa=
l or greater than the double ||
- || double || rand(), rand(int seed) || returns a random number (that chan=
ges from row to row). Specifiying the seed will make sure the generated ran=
dom number sequence is deterministic.||
+ ||double ||rand(), rand(int seed) ||returns a random number (that changes=
 from row to row). Specifiying the seed will make sure the generated random=
 number sequence is deterministic. ||
- || string || concat(string A, string B)|| returns the string resulting fr=
om concatenating B after A e.g. concat('foo', 'bar') results in 'foobar'||
+ ||string ||concat(string A, string B) ||returns the string resulting from=
 concatenating B after A e.g. concat('foo', 'bar') results in 'foobar' ||
- || string || substr(string A, int start) || returns the substring of A st=
arting from start position till the end of string A e.g. substr('foobar', 4=
) results in 'bar'||
+ ||string ||substr(string A, int start) ||returns the substring of A start=
ing from start position till the end of string A e.g. substr('foobar', 4) r=
esults in 'bar' ||
- || string || substr(string A, int start, int length) || returns the subst=
ring of A starting from start position with the given length e.g. substr('f=
oobar', 4, 2) results in 'ba'||
+ ||string ||substr(string A, int start, int length) ||returns the substrin=
g of A starting from start position with the given length e.g. substr('foob=
ar', 4, 2) results in 'ba' ||
- || string || upper(string A)|| returns the string resulting from converti=
ng all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBA=
R'||
+ ||string ||upper(string A) ||returns the string resulting from converting=
 all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBAR'=
 ||
- || string || ucase(string A) || Same as upper||
+ ||string ||ucase(string A) ||Same as upper ||
- || string || lower(string A) || returns the string resulting from convert=
ing all characters of B to lower case e.g. lower('fOoBaR') results in 'foob=
ar'||
+ ||string ||lower(string A) ||returns the string resulting from converting=
 all characters of B to lower case e.g. lower('fOoBaR') results in 'foobar'=
 ||
- || string || lcase(string A) || Same as lower||
+ ||string ||lcase(string A) ||Same as lower ||
- || string || trim(string A) || returns the string resulting from trimming=
 spaces from both ends of A e.g. trim('   foobar   ') results in 'foobar'||
+ ||string ||trim(string A) ||returns the string resulting from trimming sp=
aces from both ends of A e.g. trim('   foobar   ') results in 'foobar' ||
- || string || ltrim(string A) || returns the string resulting from trimmin=
g spaces from the beginning(left hand side) of A e.g. ltrim('   foobar   ')=
 results in 'foobar   '||
+ ||string ||ltrim(string A) ||returns the string resulting from trimming s=
paces from the beginning(left hand side) of A e.g. ltrim('   foobar   ') re=
sults in 'foobar   ' ||
- || string || rtrim(string A) || returns the string resulting from trimmin=
g spaces from the end(right hand side) of A e.g. rtrim('   foobar   ') resu=
lts in '   foobar'||
+ ||string ||rtrim(string A) ||returns the string resulting from trimming s=
paces from the end(right hand side) of A e.g. rtrim('   foobar   ') results=
 in '   foobar' ||
- || string || regexp_replace(string A, string B, string C) || returns the =
string resulting from replacing all substrings in B that match the Java reg=
ular expression syntax(See [[http://java.sun.com/j2se/1.4.2/docs/api/java/u=
til/regex/Pattern.html|Java regular expressions syntax]]) with C e.g. regex=
p_replace('foobar', 'oo<nowiki>|</nowiki>ar', '') returns 'fb'||
+ ||string ||regexp_replace(string A, string B, string C) ||returns the str=
ing resulting from replacing all substrings in B that match the Java regula=
r expression syntax(See [[http://java.sun.com/j2se/1.4.2/docs/api/java/util=
/regex/Pattern.html|Java regular expressions syntax]]) with C e.g. regexp_r=
eplace('foobar', 'oo<nowiki>|</nowiki>ar', '') returns 'fb''' ''''' ||
- || int || size(Map<K.V>) || returns the number of elements in the map typ=
e||
+ ||int ||size(Map<K.V>) ||returns the number of elements in the map type ||
- || int || size(Array<T>) || returns the number of elements in the array t=
ype||
+ ||int ||size(Array<T>) ||returns the number of elements in the array type=
 ||
- || <type> || cast(expr as <type>) || converts the results of the expressi=
on expr to <type> e.g. cast('1' as BIGINT) will convert the string '1' to i=
t integral representation. A null is returned if the conversion does not su=
cceed.||
+ ||'''Expected "=3D" to follow "type"''' ||cast(expr as <type>) ||converts=
 the results of the expression expr to <type> e.g. cast('1' as BIGINT) will=
 convert the string '1' to it integral representation. A null is returned i=
f the conversion does not succeed. ||
- || string || from_unixtime(int unixtime) || convert the number of seconds=
 from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the tim=
estamp of that moment in the current system time zone in the format of "197=
0-01-01 00:00:00"||
+ ||string ||from_unixtime(int unixtime) ||convert the number of seconds fr=
om unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timest=
amp of that moment in the current system time zone in the format of "1970-0=
1-01 00:00:00" ||
- || string || to_date(string timestamp) || Return the date part of a times=
tamp string: to_date("1970-01-01 00:00:00") =3D "1970-01-01"||
+ ||string ||to_date(string timestamp) ||Return the date part of a timestam=
p string: to_date("1970-01-01 00:00:00") =3D "1970-01-01" ||
- || int || year(string date) || Return the year part of a date or a timest=
amp string: year("1970-01-01 00:00:00") =3D 1970, year("1970-01-01") =3D 19=
70||
+ ||int ||year(string date) ||Return the year part of a date or a timestamp=
 string: year("1970-01-01 00:00:00") =3D 1970, year("1970-01-01") =3D 1970 =
||
- || int || month(string date) || Return the month part of a date or a time=
stamp string: month("1970-11-01 00:00:00") =3D 11, month("1970-11-01") =3D =
11||
+ ||int ||month(string date) ||Return the month part of a date or a timesta=
mp string: month("1970-11-01 00:00:00") =3D 11, month("1970-11-01") =3D 11 =
||
- || int || day(string date) || Return the day part of a date or a timestam=
p string: day("1970-11-01 00:00:00") =3D 1, day("1970-11-01") =3D 1||
+ ||int ||day(string date) ||Return the day part of a date or a timestamp s=
tring: day("1970-11-01 00:00:00") =3D 1, day("1970-11-01") =3D 1 ||
- ||string ||get_json_object(string json_string, string path) ||Extract jso=
n object from a json string based on json path specified, and return json s=
tring of the extracted json object. It will return null if the input json s=
tring is invalid||
+ ||string ||get_json_object(string json_string, string path) ||Extract jso=
n object from a json string based on json path specified, and return json s=
tring of the extracted json object. It will return null if the input json s=
tring is invalid ||
+ =

+ =

+ =

  =

  *The following built in aggregate functions are supported in hive:
  =

-  ''' Built in Aggregate Functions '''
+  . Built in Aggregate Functions ''' '''
+ =

- || Return Type || Name(Signature) || Description||
+ ||Return Type ||Name(Signature) ||Description ||
- || BIGINT || count(1), count(DISTINCT col [, col]...)|| count(1) returns =
the number of members in the group, whereas the count(DISTINCT col) gets th=
e count of distinct values of the columns in the group||
+ ||BIGINT ||count(1), count(DISTINCT col [, col]...) ||count(1) returns th=
e number of members in the group, whereas the count(DISTINCT col) gets the =
count of distinct values of the columns in the group ||
- || DOUBLE || sum(col), sum(DISTINCT col) || returns the sum of the elemen=
ts in the group or the sum of the distinct values of the column in the grou=
p||
+ ||DOUBLE ||sum(col), sum(DISTINCT col) ||returns the sum of the elements =
in the group or the sum of the distinct values of the column in the group ||
- || DOUBLE || avg(col), avg(DISTINCT col) || returns the average of the el=
ements in the group or the average of the distinct values of the column in =
the group||
+ ||DOUBLE ||avg(col), avg(DISTINCT col) ||returns the average of the eleme=
nts in the group or the average of the distinct values of the column in the=
 group ||
- || DOUBLE || min(col) || returns the minimum value of the column in the g=
roup||
+ ||DOUBLE ||min(col) ||returns the minimum value of the column in the grou=
p ||
- || DOUBLE || max(col) || returns the maximum value of the column in the g=
roup||
+ ||DOUBLE ||max(col) ||returns the maximum value of the column in the grou=
p ||
+ =

  =

  =3D=3D Language capabilities =3D=3D
- Hive query language provides the basic SQL like operations. These operati=
ons work on tables or partitions. These operations are:
+ '''Hive query language provides the basic SQL like operations. These oper=
ations work on tables or partitions. These operations are: '''
+ =

-     * Ability to filter rows from a table using a where clause.
+  * '''Ability to filter rows from a table using a where clause. '''
-     * Ability to select certain columns from the table using a select cla=
use.
+  * '''Ability to select certain columns from the table using a select cla=
use. '''
-     * Ability to do equi-joins between two tables.
+  * '''Ability to do equi-joins between two tables. '''
-     * Ability to evaluate aggregations on multiple "group by" columns for=
 the data stored in a table.
+  * '''Ability to evaluate aggregations on multiple "group by" columns for=
 the data stored in a table. '''
-     * Ability to store the results of a query into another table.
+  * '''Ability to store the results of a query into another table. '''
-     * Ability to download the contents of a table to a local (e.g., nfs) =
directory.
+  * '''Ability to download the contents of a table to a local (e.g., nfs) =
directory. '''
-     * Ability to store the results of a query in a hadoop dfs directory.
+  * '''Ability to store the results of a query in a hadoop dfs directory. =
'''
-     * Ability to manage tables and partitions (create, drop and alter).  =

+  * '''Ability to manage tables and partitions (create, drop and alter). '=
''
-     * Ability to plug in custom scripts in the language of choice for cus=
tom map/reduce jobs.
+  * '''Ability to plug in custom scripts in the language of choice for cus=
tom map/reduce jobs. '''
  =

  =3D Usage and Examples =3D
- The following examples highlight some salient features of the system. A d=
etailed set of query test cases can be found at [[http://svn.apache.org/vie=
wvc/hadoop/hive/trunk/ql/src/test/queries/clientpositive/|Hive Query Test C=
ases]] and the corresponding results can be found at [[http://svn.apache.or=
g/viewvc/hadoop/hive/trunk/ql/src/test/results/clientpositive/|Query Test C=
ase Results]]
+ '''The following examples highlight some salient features of the system. =
A detailed set of query test cases can be found at [[http://svn.apache.org/=
viewvc/hadoop/hive/trunk/ql/src/test/queries/clientpositive/|Hive Query Tes=
t Cases]] and the corresponding results can be found at [[http://svn.apache=
.org/viewvc/hadoop/hive/trunk/ql/src/test/results/clientpositive/|Query Tes=
t Case Results]] '''
  =

  =3D=3D Creating Tables =3D=3D
- An example statement that would create the page_view table mentioned abov=
e would be like:
+ '''An example statement that would create the page_view table mentioned a=
bove would be like: '''
- {{{     =

-     CREATE TABLE page_view(viewTime INT, userid BIGINT,
-                     page_url STRING, referrer_url STRING, =

-                     ip STRING COMMENT 'IP Address of the User') =

-     COMMENT 'This is the page view table' =

-     PARTITIONED BY(dt STRING, country STRING) =

-     STORED AS SEQUENCEFILE; =

- }}}
  =

+ {{{
+ }}}
- In this example the columns of the table are specified with the correspon=
ding types. Comments can be attached both at the column level as well as at=
 the table level. Additionally the partitioned by clause defines the partit=
ioning columns which are different from the data columns and are actually n=
ot stored with the data. When specified in this way, the data in the files =
is assumed to be delimited with ascii 001(ctrl-A) used as the field delimit=
er and newline used as a row delimiter.
+ '''In this example the columns of the table are specified with the corres=
ponding types. Comments can be attached both at the column level as well as=
 at the table level. Additionally the partitioned by clause defines the par=
titioning columns which are different from the data columns and are actuall=
y not stored with the data. When specified in this way, the data in the fil=
es is assumed to be delimited with ascii 001(ctrl-A) used as the field deli=
miter and newline used as a row delimiter. '''
  =

- These delimiters can be parametrized if the data is not in the above form=
at as illustrated in the following example: =

+ '''These delimiters can be parametrized if the data is not in the above f=
ormat as illustrated in the following example: '''
- {{{    =

-     CREATE TABLE page_view(viewTime INT, userid BIGINT,
-                     page_url STRING, referrer_url STRING, =

-                     ip STRING COMMENT 'IP Address of the User') =

-     COMMENT 'This is the page view table' =

-     PARTITIONED BY(dt STRING, country STRING) =

-     ROW FORMAT DELIMITED
-             FIELDS TERMINATED BY '1' =

-             LINES TERMINATED BY '12' =

-     STORED AS SEQUENCEFILE; =

- }}}
  =

+ {{{
+ }}}
- The ROW FORMAT clause allows the user to specify both the field delimiter=
s as well as the line delimiters.
+ '''The ROW FORMAT clause allows the user to specify both the field delimi=
ters as well as the line delimiters. '''
  =

- It is also a good idea to bucket the tables on certain columns so that ef=
ficient sampling queries can be executed against the data set (note: If buc=
keting is absent, random sampling can still be done on the table). The foll=
owing example illustrates the case of the page_view table which is bucketed=
 on userid column: =

+ '''It is also a good idea to bucket the tables on certain columns so that=
 efficient sampling queries can be executed against the data set (note: If =
bucketing is absent, random sampling can still be done on the table). The f=
ollowing example illustrates the case of the page_view table which is bucke=
ted on userid column: '''
+ =

+ {{{
- {{{     =

-     CREATE TABLE page_view(viewTime INT, userid BIGINT,
-                     page_url STRING, referrer_url STRING, =

-                     ip STRING COMMENT 'IP Address of the User') =

-     COMMENT 'This is the page view table' =

-     PARTITIONED BY(dt STRING, country STRING) =

-     CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS =

-     ROW FORMAT DELIMITED
-             FIELDS TERMINATED BY '1' =

-             COLLECTION ITEMS TERMINATED BY '2' =

-             MAP KEYS TERMINATED BY '3' =

-             LINES TERMINATED BY '12' =

-     STORED AS SEQUENCEFILE; =

  }}}
- In the example above, the table is bucketed (clustered by) userid and wit=
hin each bucket the data is sorted in the increasing order of viewTime. Suc=
h an organization allows the user to do efficient sampling on the clustered=
 column - in this case userid. The sorting property allows internal operato=
rs to take advantage of the better-known data structure while evaluating qu=
eries, also increasing efficiency.
+ '''In the example above, the table is bucketed (clustered by) userid and =
within each bucket the data is sorted in the increasing order of viewTime. =
Such an organization allows the user to do efficient sampling on the cluste=
red column - in this case userid. The sorting property allows internal oper=
ators to take advantage of the better-known data structure while evaluating=
 queries, also increasing efficiency. '''
- {{{    =

-     CREATE TABLE page_view(viewTime INT, userid BIGINT,
-                     page_url STRING, referrer_url STRING, =

-                     friends ARRAY<BIGINT>, properties MAP<STRING, STRING> =

-                     ip STRING COMMENT 'IP Address of the User') =

-     COMMENT 'This is the page view table' =

-     PARTITIONED BY(dt STRING, country STRING) =

-     CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS =

-     ROW FORMAT DELIMITED
-             FIELDS TERMINATED BY '1' =

-             COLLECTION ITEMS TERMINATED BY '2' =

-             MAP KEYS TERMINATED BY '3' =

-             LINES TERMINATED BY '12' =

-     STORED AS SEQUENCEFILE; =

- }}}
  =

+ {{{
+ }}}
- In this example the columns that comprise of the table row are specified =
in a similar way as the definition of types. Comments can be attached both =
at the column level as well as at the table level. Additionally the partiti=
oned by clause defines the partitioning columns which are different from th=
e data columns and are actually not stored with the data. The bucketed on c=
lause specifies which column to use for bucketing as well as how many bucke=
ts to create. The delimited row format specifies how the rows are stored in=
 the hive table. In the case of the delimited format, this specifies how th=
e fields are terminated, how the items within collections (arrays or maps) =
are terminated and how the map keys are terminated. STORED AS SEQUENCEFILE =
indicates that this data is stored in a binary format (using hadoop Sequenc=
eFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses =
in the above example represent the system defaults.
+ '''In this example the columns that comprise of the table row are specifi=
ed in a similar way as the definition of types. Comments can be attached bo=
th at the column level as well as at the table level. Additionally the part=
itioned by clause defines the partitioning columns which are different from=
 the data columns and are actually not stored with the data. The bucketed o=
n clause specifies which column to use for bucketing as well as how many bu=
ckets to create. The delimited row format specifies how the rows are stored=
 in the hive table. In the case of the delimited format, this specifies how=
 the fields are terminated, how the items within collections (arrays or map=
s) are terminated and how the map keys are terminated. STORED AS SEQUENCEFI=
LE indicates that this data is stored in a binary format (using hadoop Sequ=
enceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS claus=
es in the above example represent the system defaults. '''
  =

- Table names and column names are case insensitive.
+ '''Table names and column names are case insensitive. '''
  =

  =3D=3D Browsing Tables and Partitions =3D=3D
- {{{ =

+ {{{
-     SHOW TABLES; =

  }}}
- =

- To list existing tables in the warehouse; there are many of these, likely=
 more than you want to browse. =

+ '''To list existing tables in the warehouse; there are many of these, lik=
ely more than you want to browse. '''
  =

- {{{ =

+ {{{
-     SHOW TABLES 'page.*'; =

  }}}
- =

- To list tables with prefix 'page'. The pattern follows Java regular expre=
ssion syntax (so the period is a wildcard). =

+ '''To list tables with prefix 'page'. The pattern follows Java regular ex=
pression syntax (so the period is a wildcard). '''
  =

- {{{ =

+ {{{
-     SHOW PARTITIONS page_view; =

  }}}
- =

- To list partitions of a table. If the table is not a partitioned table th=
en an error is thrown. =

+ '''To list partitions of a table. If the table is not a partitioned table=
 then an error is thrown. '''
  =

- {{{ =

+ {{{
-     DESCRIBE page_view; =

  }}}
- =

- To list columns and column types of table. =

+ '''To list columns and column types of table. '''
  =

- {{{ =

+ {{{
-     DESCRIBE EXTENDED page_view; =

  }}}
- =

- To list columns and all other properties of table. This prints lot of inf=
ormation and that too not in a pretty format. Usually used for debugging. =

+ '''To list columns and all other properties of table. This prints lot of =
information and that too not in a pretty format. Usually used for debugging=
. '''
  =

- {{{ =

+ {{{
-    DESCRIBE EXTENDED page_view PARTITION (ds=3D'2008-08-08'); =

  }}}
- =

- To list columns and all other properties of a partition. This also prints=
 lot of information which is usually used for debugging. =

+ '''To list columns and all other properties of a partition. This also pri=
nts lot of information which is usually used for debugging. '''
  =

  =3D=3D Loading Data =3D=3D
- There are multiple mechanisms of loading data into Hive tables. The user =
can create an external table that points to a specified location within hdf=
s. In this particular usage, the user can copy a file into the specified lo=
cation using the hdfs put or copy commands and create a table pointing to t=
his location with all the relevant row format information. Once this is don=
e, the user can transform this data and insert into any other Hive table. e=
.g. if the file /tmp/pv_2008-06-08.txt contains comma separated page views =
served on 2008-06-08, and this needs to be loaded into the page_view table =
in the appropriate partition, the following sequence of commands can achiev=
e this: =

+ '''There are multiple mechanisms of loading data into Hive tables. The us=
er can create an external table that points to a specified location within =
hdfs. In this particular usage, the user can copy a file into the specified=
 location using the hdfs put or copy commands and create a table pointing t=
o this location with all the relevant row format information. Once this is =
done, the user can transform this data and insert into any other Hive table=
. e.g. if the file /tmp/pv_2008-06-08.txt contains comma separated page vie=
ws served on 2008-06-08, and this needs to be loaded into the page_view tab=
le in the appropriate partition, the following sequence of commands can ach=
ieve this: '''
  =

+ {{{
- {{{     =

-     CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
-                     page_url STRING, referrer_url STRING, =

-                     ip STRING COMMENT 'IP Address of the User', =

-                     country STRING COMMENT 'country of origination') =

-     COMMENT 'This is the staging page view table' =

-     ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY '1=
2' =

-     STORED AS TEXTFILE =

-     LOCATION '/user/data/stagging/page_view'; =

- =

-     hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view =

- =

-     FROM page_view_stg pvs =

-     INSERT OVERWRITE TABLE page_view PARTITION(dt=3D'2008-06-08', country=
=3D'US') =

-     SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null=
, null, pvs.ip =

-     WHERE pvs.country =3D 'US'; =

  }}}
- =

- In the example above nulls are inserted for the array and map types in th=
e destination tables but potentially these can also come from the external =
table if the proper row formats are specified.
+ '''In the example above nulls are inserted for the array and map types in=
 the destination tables but potentially these can also come from the extern=
al table if the proper row formats are specified. '''
  =

- This method is useful if there is already legacy data in hdfs on which th=
e user wants to put some metadata so that that the data can be queried and =
manipulated using hive.
+ '''This method is useful if there is already legacy data in hdfs on which=
 the user wants to put some metadata so that that the data can be queried a=
nd manipulated using hive. '''
  =

- Additionally, the system also supports syntax that can load the data from=
 a file in the local files system directly into a hive table where the inpu=
t data format is same as the table format. If /tmp/pv_2008-06-08_us.txt alr=
eady contains the data for US, then we do not need any additional filtering=
 as shown in the previous example. The load in this case can be done using =
the following syntax: =

+ '''Additionally, the system also supports syntax that can load the data f=
rom a file in the local files system directly into a hive table where the i=
nput data format is same as the table format. If /tmp/pv_2008-06-08_us.txt =
already contains the data for US, then we do not need any additional filter=
ing as shown in the previous example. The load in this case can be done usi=
ng the following syntax: '''
+ =

- {{{ =

+ {{{
-    LOAD DATA LOCAL INPATH `/tmp/pv_2008-06-08_us.txt` INTO TABLE page_vie=
w PARTITION(date=3D'2008-06-08', country=3D'US')  =

  }}}
- The path argument can take a directory (in which case all the files in th=
e directory are loaded), a single file name, or a wildcard (in which case a=
ll the matching files are uploaded). If the argument is a directory - it ca=
nnot contain subdirectories. Similarly - the wildcard must match file names=
 only.
+ '''The path argument can take a directory (in which case all the files in=
 the directory are loaded), a single file name, or a wildcard (in which cas=
e all the matching files are uploaded). If the argument is a directory - it=
 cannot contain subdirectories. Similarly - the wildcard must match file na=
mes only. '''
  =

- In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, =
the user may decide to do a parallel load of the data (using tools that are=
 external to Hive). Once the file is in HDFS - the following syntax can be =
used to load the data into a Hive table: =

+ '''In the case that the input file /tmp/pv_2008-06-08_us.txt is very larg=
e, the user may decide to do a parallel load of the data (using tools that =
are external to Hive). Once the file is in HDFS - the following syntax can =
be used to load the data into a Hive table: '''
+ =

- {{{ =

+ {{{
-    LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_vie=
w PARTITION(date=3D'2008-06-08', country=3D'US') =

  }}}
- It is assumed that the array and map fields in the input.txt files are nu=
ll fields for these examples. =

+ '''It is assumed that the array and map fields in the input.txt files are=
 null fields for these examples. '''
  =

  =3D=3D Simple Query =3D=3D
- For all the active users, one can use the query of the following form: =

+ '''For all the active users, one can use the query of the following form:=
 '''
+ =

+ {{{
- {{{     =

-     INSERT OVERWRITE TABLE user_active =

-     SELECT user.* =

-     FROM user =

-     WHERE user.active =3D 1;  =

  }}}
- Note that unlike SQL, we always insert the results into a table. We will =
illustrate later how the user can inspect these results and even dump them =
to a local file. =

+ '''Note that unlike SQL, we always insert the results into a table. We wi=
ll illustrate later how the user can inspect these results and even dump th=
em to a local file. '''
  =

- =3D=3D Partition Based Query  =3D=3D
+ =3D=3D Partition Based Query =3D=3D
- What partitions to use in a query is determined automatically by the syst=
em on the basis of where clause conditions on partition columns. e.g. in or=
der to get all the page_views in the month of 03/2008 referred from domain =
xyz.com, one could write the following query: =

+ '''What partitions to use in a query is determined automatically by the s=
ystem on the basis of where clause conditions on partition columns. e.g. in=
 order to get all the page_views in the month of 03/2008 referred from doma=
in xyz.com, one could write the following query: '''
  =

+ {{{
- {{{     =

-     INSERT OVERWRITE TABLE xyz_com_page_views =

-     SELECT page_views.* =

-     FROM page_views =

-     WHERE page_views.date >=3D '2008-03-01' AND page_views.date <=3D '200=
8-03-31' AND
-           page_views.referrer_url like '%xyz.com';  =

  }}}
- (Note that page_views.date is used here because the table (above) was def=
ined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your =
partition something different, don't expect .date to do what you think!) =

+ '''(Note that page_views.date is used here because the table (above) was =
defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name yo=
ur partition something different, don't expect .date to do what you think!)=
 '''
  =

  =3D=3D Joins =3D=3D
- In order to get a demographic breakdown (by gender) of page_view of 2008-=
03-03 one would need to join the page_view table and the user table on the =
userid column. This can be accomplished with a join as shown in the followi=
ng query: =

+ '''In order to get a demographic breakdown (by gender) of page_view of 20=
08-03-03 one would need to join the page_view table and the user table on t=
he userid column. This can be accomplished with a join as shown in the foll=
owing query: '''
- {{{     =

-     INSERT OVERWRITE TABLE pv_users =

-     SELECT pv.*, u.gender, u.age =

-     FROM user u JOIN page_view pv ON (pv.userid =3D u.id) =

-     WHERE pv.date =3D '2008-03-03';  =

- }}}
  =

+ {{{
+ }}}
- In order to do outer joins the user can qualify the join with LEFT OUTER,=
 RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer =
join (left preserved, right preserved or both sides preserved). e.g. in ord=
er to do a full outer join in the query above, the corresponding syntax wou=
ld look like the following query: =

+ '''In order to do outer joins the user can qualify the join with LEFT OUT=
ER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of out=
er join (left preserved, right preserved or both sides preserved). e.g. in =
order to do a full outer join in the query above, the corresponding syntax =
would look like the following query: '''
+ =

+ {{{
- {{{     =

-     INSERT OVERWRITE TABLE pv_users =

-     SELECT pv.*, u.gender, u.age =

-     FROM user u FULL OUTER JOIN page_view pv ON (pv.userid =3D u.id) =

-     WHERE pv.date =3D '2008-03-03';  =

  }}}
- In order to join more than one tables, the user can use the following syn=
tax: =

+ '''In order to join more than one tables, the user can use the following =
syntax: '''
- {{{     =

-     INSERT OVERWRITE TABLE pv_friends =

-     SELECT pv.*, u.gender, u.age, f.friends =

-     FROM page_view pv JOIN user u ON (pv.userid =3D u.id) JOIN friend_lis=
t f ON (u.id =3D f.uid) =

-     WHERE pv.date =3D '2008-03-03'; =

- }}}
  =

+ {{{
+ }}}
- Note that Hive only supports [[http://en.wikipedia.org/wiki/Join_(SQL)#Eq=
ui-join|equi-joins]]. Note also that it is best to put the largest table on=
 the rightmost side of the join in order to avoid memory errors.
+ '''Note that Hive only supports [[http://en.wikipedia.org/wiki/Join_(SQL)=
#Equi-join|equi-joins]]. Note also that it is best to put the largest table=
 on the rightmost side of the join in order to avoid memory errors. '''
  =

  =3D=3D Aggregations =3D=3D
- In order to count the number of distinct users by gender one could write =
the following query: =

+ '''In order to count the number of distinct users by gender one could wri=
te the following query: '''
  =

+ {{{
- {{{     =

-     INSERT OVERWRITE TABLE pv_gender_sum =

-     SELECT pv_users.gender, count (DISTINCT pv_users.userid) =

-     FROM pv_users =

-     GROUP BY pv_users.gender; =

  }}}
- =

- Multiple aggregations can be done at the same time, however, no two aggre=
gations can have different DISTINCT columns .e.g while the following is pos=
sible =

+ '''Multiple aggregations can be done at the same time, however, no two ag=
gregations can have different DISTINCT columns .e.g while the following is =
possible '''
  =

+ {{{
- {{{     =

-     INSERT OVERWRITE TABLE pv_gender_agg =

-     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), su=
m(DISTINCT pv_users.userid) =

-     FROM pv_users =

-     GROUP BY pv_users.gender;  =

  }}}
- however, the following query is not allowed =

+ '''however, the following query is not allowed '''
  =

+ {{{
- {{{     =

-     INSERT OVERWRITE TABLE pv_gender_agg =

-     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTIN=
CT pv_users.ip) =

-     FROM pv_users =

-     GROUP BY pv_users.gender;  =

  }}}
- =

  =3D=3D Multi Table/File Inserts =3D=3D
- The output of the aggregations or simple selects can be further sent into=
 multiple tables or even to hadoop dfs files (which can then be manipulated=
 using hdfs utilities). e.g. if along with the gender breakdown, one needed=
 to find the breakdown of unique page views by age, one could accomplish th=
at with the following query: =

+ '''The output of the aggregations or simple selects can be further sent i=
nto multiple tables or even to hadoop dfs files (which can then be manipula=
ted using hdfs utilities). e.g. if along with the gender breakdown, one nee=
ded to find the breakdown of unique page views by age, one could accomplish=
 that with the following query: '''
- {{{     =

-     FROM pv_users =

-     INSERT OVERWRITE TABLE pv_gender_sum
-         SELECT pv_users.gender, count_distinct(pv_users.userid) =

-         GROUP BY pv_users.gender =

  =

+ {{{
-     INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
-         SELECT pv_users.age, count_distinct(pv_users.userid) =

-         GROUP BY pv_users.age;  =

  }}}
- The first insert clause sends the results of the first group by to a Hive=
 table while the second one sends the results to a hadoop dfs files. =

+ '''The first insert clause sends the results of the first group by to a H=
ive table while the second one sends the results to a hadoop dfs files. '''
  =

- =3D=3D Inserting into local files  =3D=3D
+ =3D=3D Inserting into local files =3D=3D
- In certain situations you would want to write the output into a local fil=
e so that you could load it into an excel spreadsheet. This can be accompli=
shed with the following command: =

+ '''In certain situations you would want to write the output into a local =
file so that you could load it into an excel spreadsheet. This can be accom=
plished with the following command: '''
  =

+ {{{
- {{{     =

-     INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum' =

-     SELECT pv_gender_sum.* =

-     FROM pv_gender_sum; =

  }}}
- =

  =3D=3D Sampling =3D=3D
- The sampling clause allows the users to write queries for samples of the =
data instead of the whole table. Currently the sampling is done on the colu=
mns that are specified in the BUCKETED ON clause of the CREATE TABLE statem=
ent. In the following example we choose 3rd bucket out of the 32 buckets of=
 the pv_gender_sum table: =

+ '''The sampling clause allows the users to write queries for samples of t=
he data instead of the whole table. Currently the sampling is done on the c=
olumns that are specified in the BUCKETED ON clause of the CREATE TABLE sta=
tement. In the following example we choose 3rd bucket out of the 32 buckets=
 of the pv_gender_sum table: '''
- {{{     =

-     INSERT OVERWRITE TABLE pv_gender_sum_sample =

-     SELECT pv_gender_sum.* =

-     FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32); =

- }}}
  =

+ {{{
+ }}}
- In general the TABLESAMPLE syntax looks like:
+ '''In general the TABLESAMPLE syntax looks like: '''
  =

- {{{ =

+ {{{
-     TABLESAMPLE(BUCKET x OUT OF y) =

- }}} =

+ }}}
- =

- y has to be a multiple or divisor of the number of buckets in that table =
as specified at the table creation time. The buckets chosen are determined =
if bucket_number module y is equal to x. So in the above example the follow=
ing tablesample clause
+ '''y has to be a multiple or divisor of the number of buckets in that tab=
le as specified at the table creation time. The buckets chosen are determin=
ed if bucket_number module y is equal to x. So in the above example the fol=
lowing tablesample clause '''
  =

- {{{ =

+ {{{
-       TABLESAMPLE(BUCKET 3 OUT OF 16) =

- }}} =

+ }}}
- =

- would pick out the 3rd and 19th buckets. The buckets are numbered startin=
g from 0.
+ '''would pick out the 3rd and 19th buckets. The buckets are numbered star=
ting from 0. '''
  =

- On the other hand the tablesample clause
+ '''On the other hand the tablesample clause '''
  =

- {{{ =

+ {{{
-      TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid) =

- }}} =

+ }}}
- =

- would pick out half of the 3rd bucket. =

+ '''would pick out half of the 3rd bucket. '''
- =

  =

  =3D=3D Union all =3D=3D
- The language also supports union all, e.g. if we suppose there are two di=
fferent tables that track which user has published a video and which user h=
as published a comment, the following query joins the results of a union al=
l with the user table to create a single annotated stream for all the video=
 publishing and comment publishing events:
+ '''The language also supports union all, e.g. if we suppose there are two=
 different tables that track which user has published a video and which use=
r has published a comment, the following query joins the results of a union=
 all with the user table to create a single annotated stream for all the vi=
deo publishing and comment publishing events: '''
  =

+ {{{
- {{{    =

-     INSERT OVERWRITE TABLE actions_users =

-     SELECT u.id, actions.date
-     FROM (
-         SELECT av.uid AS uid =

-         FROM action_video av =

-         WHERE av.date =3D '2008-06-03' =

- =

-         UNION ALL =

- =

-         SELECT ac.uid AS uid =

-         FROM action_comment ac =

-         WHERE ac.date =3D '2008-06-03' =

-         ) actions JOIN users u ON(u.id =3D actions.uid);
  }}}
- =

  =3D=3D Array Operations =3D=3D
- Array columns in tables can only be created programmatically currently. W=
e will be extending this soon to be available as part of the create table s=
tatement. For the purpose of the current example assume that pv.friends is =
of the type array<INT> i.e. it is an array of integers.The user can get a s=
pecific element in the array by its index as shown in the following command:
+ '''Array columns in tables can only be created programmatically currently=
. We will be extending this soon to be available as part of the create tabl=
e statement. For the purpose of the current example assume that pv.friends =
is of the type array<INT> i.e. it is an array of integers.The user can get =
a specific element in the array by its index as shown in the following comm=
and: '''
-     =

+ =

- {{{ =

+ {{{
-     SELECT pv.friends[2] =

-     FROM page_views pv; =

- }}} =

+ }}}
- =

- The select expressions gets the third item in the pv.friends array.
+ '''The select expressions gets the third item in the pv.friends array. '''
  =

- The user can also get the length of the array using the size function as =
shown below:
+ '''The user can also get the length of the array using the size function =
as shown below: '''
  =

- {{{ =

+ {{{
-    SELECT pv.userid, size(pv.friends) =

-    FROM page_view pv; =

- }}} =

+ }}}
- =

  =3D=3D Map(Associative Arrays) Operations =3D=3D
- =

- Maps provide collections similar to associative arrays. Such structures c=
an only be created programmatically currently. We will be extending this so=
on. For the purpose of the current example assume that pv.properties is of =
the type map<String, String> i.e. it is an associative array from strings t=
o string. Accordingly, the following query:
+ '''Maps provide collections similar to associative arrays. Such structure=
s can only be created programmatically currently. We will be extending this=
 soon. For the purpose of the current example assume that pv.properties is =
of the type map<String, String> i.e. it is an associative array from string=
s to string. Accordingly, the following query: '''
  =

- {{{ =

+ {{{
-     INSERT OVERWRITE page_views_map =

-     SELECT pv.userid, pv.properties['page type'] =

-     FROM page_views pv; =

- }}} =

+ }}}
- =

- can be used to select the 'page_type' property from the page_views table.
+ '''can be used to select the 'page_type' property from the page_views tab=
le. '''
  =

- Similar to arrays, the size function can also be used to get the number o=
f elements in a map as shown in the following query:
+ '''Similar to arrays, the size function can also be used to get the numbe=
r of elements in a map as shown in the following query: '''
  =

  {{{
-    SELECT size(pv.properties) =

-    FROM page_view pv; =

  }}}
- =

  =3D=3D Custom map/reduce scripts =3D=3D
- Users can also plug in their own custom mappers and reducers in the data =
stream by using features natively supported in the Hive language. e.g. in o=
rder to run a custom mapper script - map_script - and a custom reducer scri=
pt - reduce_script - the user can issue the following command which uses th=
e TRANSFORM clause to embed the mapper and the reducer scripts.
+ '''Users can also plug in their own custom mappers and reducers in the da=
ta stream by using features natively supported in the Hive language. e.g. i=
n order to run a custom mapper script - map_script - and a custom reducer s=
cript - reduce_script - the user can issue the following command which uses=
 the TRANSFORM clause to embed the mapper and the reducer scripts. '''
  =

- Note that columns will be transformed to string and delimited by TAB befo=
re feeding to the user script, and the standard output of the user script w=
ill be treated as TAB-separated string columns. User scripts can output deb=
ug information to standard error which will be shown on the task detail pag=
e on hadoop.
+ '''Note that columns will be transformed to string and delimited by TAB b=
efore feeding to the user script, and the standard output of the user scrip=
t will be treated as TAB-separated string columns. User scripts can output =
debug information to standard error which will be shown on the task detail =
page on hadoop. '''
  =

- {{{ =

+ {{{
-    FROM (
-         FROM pv_users =

-         MAP pv_users.userid, pv_users.date =

-         USING 'map_script' =

-         AS dt, uid =

-         CLUSTER BY dt) map_output =

- =

-     INSERT OVERWRITE TABLE pv_users_reduced
-         REDUCE map_output.dt, map_output.uid =

-         USING 'reduce_script' =

-         AS date, count; =

- }}} =

+ }}}
- =

- Sample map script (weekday_mapper.py )
+ '''Sample map script (weekday_mapper.py ) '''
- {{{
- import sys
- import datetime
  =

+ {{{
- for line in sys.stdin:
-   line =3D line.strip()
-   userid, unixtime =3D line.split('\t')
-   weekday =3D datetime.datetime.fromtimestamp(float(unixtime)).isoweekday=
()
-   print ','.join([userid, str(weekday)])
  }}}
- =

- Of course, both MAP and REDUCE are "syntactic sugar" for the more general=
 select transform. The inner query could also have been written as such:
+ '''Of course, both MAP and REDUCE are "syntactic sugar" for the more gene=
ral select transform. The inner query could also have been written as such:=
 '''
  =

- {{{ =

+ {{{
-     SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' A=
S dt, uid CLUSTER BY dt FROM pv_users;
- }}} =

+ }}}
- =

- Schema-less map/reduce: If there is no "AS" clause after "USING map_scrip=
t", Hive assumes the output of the script contains 2 parts: key which is be=
fore the first tab, and value which is the rest after the first tab. Note t=
hat this is different from specifying "AS key, value" because in that case =
value will only contains the portion between the first tab and the second t=
ab if there are multiple tabs.
+ '''Schema-less map/reduce: If there is no "AS" clause after "USING map_sc=
ript", Hive assumes the output of the script contains 2 parts: key which is=
 before the first tab, and value which is the rest after the first tab. Not=
e that this is different from specifying "AS key, value" because in that ca=
se value will only contains the portion between the first tab and the secon=
d tab if there are multiple tabs. '''
  =

- In this way, we allow users to migrate old map/reduce scripts without kno=
wing the schema of the map output. User still needs to know the reduce outp=
ut schema because that has to match what is in the table that we are insert=
ing to.
+ '''In this way, we allow users to migrate old map/reduce scripts without =
knowing the schema of the map output. User still needs to know the reduce o=
utput schema because that has to match what is in the table that we are ins=
erting to. '''
  =

- {{{ =

+ {{{
-     FROM (
-         FROM pv_users =

-         MAP pv_users.userid, pv_users.date =

-         USING 'map_script' =

-         CLUSTER BY key) map_output =

- =

-     INSERT OVERWRITE TABLE pv_users_reduced
- =

-         REDUCE map_output.dt, map_output.uid =

-         USING 'reduce_script' =

-         AS date, count; =

  }}}
- =

- Distribute By and Sort By: Instead of specifying "cluster by", the user c=
an specify "distribute by" and "sort by", so the partition columns and sort=
 columns can be different. The usual case is that the partition columns are=
 a prefix of sort columns, but that is not required.
+ '''Distribute By and Sort By: Instead of specifying "cluster by", the use=
r can specify "distribute by" and "sort by", so the partition columns and s=
ort columns can be different. The usual case is that the partition columns =
are a prefix of sort columns, but that is not required. '''
  =

- {{{ =

+ {{{
-     FROM (
-         FROM pv_users =

-         MAP pv_users.userid, pv_users.date =

-         USING 'map_script' =

-         AS c1, c2, c3 =

-         DISTRIBUTE BY c2 =

-         SORT BY c2, c1) map_output =

- =

-     INSERT OVERWRITE TABLE pv_users_reduced
- =

-         REDUCE map_output.c1, map_output.c2, map_output.c3 =

-         USING 'reduce_script' =

-         AS date, count; =

  }}}
- =

  =3D=3D Co groups =3D=3D
- Amongst the user community using map/reduce, cogroup is a fairly common o=
peration wherein the data from multiple tables are sent to a custom reducer=
 such that the rows are grouped by the values of certain columns on the tab=
les. With the UNION ALL operator and the CLUSTER BY specification, this can=
 be achieved in the Hive query language in the following way. Suppose we wa=
nted to cogroup the rows from the actions_video and action_comments table o=
n the uid column and send them to the 'reduce_script' custom reducer, the f=
ollowing syntax can be used by the user:
+ '''Amongst the user community using map/reduce, cogroup is a fairly commo=
n operation wherein the data from multiple tables are sent to a custom redu=
cer such that the rows are grouped by the values of certain columns on the =
tables. With the UNION ALL operator and the CLUSTER BY specification, this =
can be achieved in the Hive query language in the following way. Suppose we=
 wanted to cogroup the rows from the actions_video and action_comments tabl=
e on the uid column and send them to the 'reduce_script' custom reducer, th=
e following syntax can be used by the user: '''
  =

- {{{ =

+ {{{
-    FROM (
-         FROM (
-                 FROM action_video av =

-                 SELECT av.uid AS uid, av.id AS id, av.date AS date =

-  =

-                UNION ALL =

- =

-                 FROM action_comment ac =

-                 SELECT ac.uid AS uid, ac.id AS id, ac.date AS date =

-         ) union_actions =

-         SELECT union_actions.uid, union_actions.id, union_actions.date =

-         CLUSTER BY union_actions.uid) map =

- =

-     INSERT OVERWRITE TABLE actions_reduced
-         SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script'=
 AS (uid, id, reduced_val); =

  }}}
- =

  =3D=3D Altering Tables =3D=3D
- To rename existing table to a new name. If a table with new name already =
exists then an error is returned: =

+ '''To rename existing table to a new name. If a table with new name alrea=
dy exists then an error is returned: '''
+ =

- {{{ =

+ {{{
-     ALTER TABLE old_table_name RENAME TO new_table_name; =

  }}}
- =

- To rename the columns of an existing table. Be sure to use the same colum=
n types, and to include an entry for each preexisting column:
+ '''To rename the columns of an existing table. Be sure to use the same co=
lumn types, and to include an entry for each preexisting column: '''
  =

- {{{ =

+ {{{
-     ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...); =

- }}} =

+ }}}
- =

- To add columns to an existing table:
+ '''To add columns to an existing table: '''
  =

- {{{ =

+ {{{
-     ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 S=
TRING DEFAULT 'def val'); =

- }}} =

+ }}}
- =

- Note that a change in the schema (such as the adding of the columns), pre=
serves the schema for the old partitions of the table in case it is a parti=
tioned table. All the queries that access these columns and run over the ol=
d partitions implicitly return a null value or the specified default values=
 for these columns. =

+ '''Note that a change in the schema (such as the adding of the columns), =
preserves the schema for the old partitions of the table in case it is a pa=
rtitioned table. All the queries that access these columns and run over the=
 old partitions implicitly return a null value or the specified default val=
ues for these columns. '''
  =

- In the later versions we can make the behavior of assuming certain values=
 as opposed to throwing an error in case the column is not found in a parti=
cular partition configurable. =

+ '''In the later versions we can make the behavior of assuming certain val=
ues as opposed to throwing an error in case the column is not found in a pa=
rticular partition configurable. '''
  =

  =3D=3D Dropping Tables and Partitions =3D=3D
- Dropping tables is fairly trivial. A drop on the table would implicitly d=
rop any indexes(this is a future feature) that would have been built on the=
 table. The associated command is
+ '''Dropping tables is fairly trivial. A drop on the table would implicitl=
y drop any indexes(this is a future feature) that would have been built on =
the table. The associated command is '''
  =

- {{{ =

+ {{{
-     DROP TABLE pv_users; =

  }}}
- =

- To dropping a partition. Alter the table to drop the partition.
+ '''To dropping a partition. Alter the table to drop the partition. '''
  =

- {{{ =

+ {{{
-     ALTER TABLE pv_users DROP PARTITION (ds=3D'2008-08-08') =

  }}}
- =

- ''' Note that any data for this table or partitions will be dropped and m=
ay not be recoverable. '''
+ Note that any data for this table or partitions will be dropped and may n=
ot be recoverable. ''' '''
 =20