Return-Path: Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: (qmail 67927 invoked from network); 24 Dec 2009 05:00:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Dec 2009 05:00:06 -0000 Received: (qmail 52968 invoked by uid 500); 24 Dec 2009 05:00:06 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 52835 invoked by uid 500); 24 Dec 2009 05:00:06 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 52826 invoked by uid 500); 24 Dec 2009 05:00:05 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 52823 invoked by uid 99); 24 Dec 2009 05:00:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Dec 2009 05:00:05 +0000 X-ASF-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,MIME_QP_LONG_LINE X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Dec 2009 04:59:54 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 816DD16E07; Thu, 24 Dec 2009 04:59:33 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Thu, 24 Dec 2009 04:59:33 -0000 Message-ID: <20091224045933.26831.1519@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Hive/Tutorial=22_by_Ning_Zhang?= Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "Hive/Tutorial" page has been changed by Ning Zhang. http://wiki.apache.org/hadoop/Hive/Tutorial?action=3Ddiff&rev1=3D17&rev2=3D= 18 -------------------------------------------------- #pragma section-numbers 1 - = <> = =3D Concepts =3D - = =3D=3D What is Hive =3D=3D - Hive is the next generation infrastructure designed with the goals of pro= viding data processing systems to enable easy data summarization, adhoc que= rying and analysis of large volumes of data. In addition it also provides a= simple query language called QL, which is based on SQL and which enables u= sers familiar with SQL to do ad-hoc querying, summarization and data analys= is easily. At the same time, Hive QL also allows traditional map/reduce pro= grammers to be able to plug in their custom mappers and reducers to do more= sophisticated analysis that may not be supported by the built-in capabilit= ies the language. = + Hive is the next generation infrastructure designed with the goals of pro= viding data processing systems to enable easy data summarization, adhoc que= rying and analysis of large volumes of data. In addition it also provides a= simple query language called QL, which is based on SQL and which enables u= sers familiar with SQL to do ad-hoc querying, summarization and data analys= is easily. At the same time, Hive QL also allows traditional map/reduce pro= grammers to be able to plug in their custom mappers and reducers to do more= sophisticated analysis that may not be supported by the built-in capabilit= ies the language. = =3D=3D What is NOT Hive =3D=3D - Hive is based on Hadoop, which is a batch processing system. Accordingly,= this system does not and cannot promise low latencies on queries. The para= digm here is strictly of submitting jobs and being notified when the jobs a= re completed as opposed to real time queries. As a result it should not be = compared with systems such as Oracle where analyses are conducted on a sign= ificantly smaller amount of data but the analyses proceed much more iterati= vely with the response times between iterations being less than a few minut= es. A typical Hive query's response time is usually greater than a couple o= f minutes. For large jobs they may even run into hours. What Hive provides = is a fault-tolerant and an scale-out option, where more commodity boxes can= be added to the Hadoop cluster as the data size and/or workload increases = and Hive will automatically benefit from that. = + Hive is based on Hadoop, which is a batch processing system. Accordingly,= this system does not and cannot promise low latencies on queries. The para= digm here is strictly of submitting jobs and being notified when the jobs a= re completed as opposed to real time queries. As a result it should not be = compared with systems such as Oracle where analyses are conducted on a sign= ificantly smaller amount of data but the analyses proceed much more iterati= vely with the response times between iterations being less than a few minut= es. A typical Hive query's response time is usually greater than a couple o= f minutes. For large jobs they may even run into hours. What Hive provides = is a fault-tolerant and an scale-out option, where more commodity boxes can= be added to the Hadoop cluster as the data size and/or workload increases = and Hive will automatically benefit from that. = In the following sections we provide a tutorial on the capabilities of th= e system. We start by describing the concepts of data types, tables and par= titions (which are very similar to what you would find in a traditional rel= ational DBMS) and then illustrate the capabilities of the QL language with = the help of some examples. = =3D=3D Data Units =3D=3D In the order of granularity - Hive data is organized into: + = - * Databases: Namespaces that separate tables and other data units fro= m naming confliction. = + * Databases: Namespaces that separate tables and other data units from n= aming conflicts. - * Tables: Homogeneous units of data which have the same schema. An ex= ample of a table could be page_views table, where each row could comprise o= f the following columns (schema): + * Tables: Homogeneous units of data which have the same schema. An examp= le of a table could be page_views table, where each row could comprise of t= he following columns (schema): - * timestamp - which is of INT type that corresponds to a unix t= imestamp of when the page was viewed. + * timestamp - which is of INT type that corresponds to a Unix time stam= p of when the page was viewed. - * userid - which is of BIGINT type that identifies the user who= viewed the page. + * userid - which is of BIGINT type that identifies the user who viewed = the page. - * page_url - which is of STRING type that captures the location= of the page. + * page_url - which is of STRING type that captures the location of the = page. - * referer_url - which is of STRING that captures the location o= f the page from where the user arrived at the current page. + * referer_url - which is of STRING that captures the location of the pa= ge from where the user arrived at the current page. - * IP - which is of STRING type that captures the IP address fro= m where the page request was made. + * IP - which is of STRING type that captures the IP address from where = the page request was made. - * Partitions: Each Table can have one or more partition Keys which de= termines how the data is stored. Partitions - apart from being storage unit= s - also allow the user to efficiently identify the rows that satisfy a cer= tain criteria. For example, a date_partition of type STRING and country_par= tition of type STRING. Each unique value of the partition keys defines a pa= rtition of the Table. For example all "US" data from "2009-12-23" is a part= ition of the page_views table. Therefore, if you run analysis on only the "= US" data for 2009-12-23, you can run that query only on the relevant partit= ion of the table thereby speeding up the analysis significantly. Note howev= er, that just because a partition is named 2009-12-23 does not mean that it= contains all or only data from that date; partitions are named after dates= for convenience but it is the user's job to guarantee the relationship bet= ween partition name and data content!). Partition columns are virtual colum= ns, they are not part of the data itself but are derived on load. + * Partitions: Each Table can have one or more partition Keys which deter= mines how the data is stored. Partitions - apart from being storage units -= also allow the user to efficiently identify the rows that satisfy a certai= n criteria. For example, a date_partition of type STRING and country_partit= ion of type STRING. Each unique value of the partition keys defines a parti= tion of the Table. For example all "US" data from "2009-12-23" is a partiti= on of the page_views table. Therefore, if you run analysis on only the "US"= data for 2009-12-23, you can run that query only on the relevant partition= of the table thereby speeding up the analysis significantly. Note however,= that just because a partition is named 2009-12-23 does not mean that it co= ntains all or only data from that date; partitions are named after dates fo= r convenience but it is the user's job to guarantee the relationship betwee= n partition name and data content!). Partition columns are virtual columns,= they are not part of the data itself but are derived on load. - * Buckets (or Cluster) : Data in each partition may in turn be divide= d into Buckets based on the value of a hash function of some column of the = Table. For example the page_views table may be bucketed by userid, which is= one of the columns, other than the partitions columns, of the page_view ta= ble. These can be used to efficiently sample the data. + * Buckets (or Cluster) : Data in each partition may in turn be divided i= nto Buckets based on the value of a hash function of some column of the Tab= le. For example the page_views table may be bucketed by userid, which is on= e of the columns, other than the partitions columns, of the page_view table= . These can be used to efficiently sample the data. = - Note that it is not necessary for tables to be partitioned or bucketed, b= ut these abstractions allow the system to prune large quantities of data du= ring query processing, resulting in faster query execution. = + Note that it is not necessary for tables to be partitioned or bucketed, b= ut these abstractions allow the system to prune large quantities of data du= ring query processing, resulting in faster query execution. = =3D=3D Type System =3D=3D =3D=3D=3D Primitive Types =3D=3D=3D - Types are associated with the columns in the tables. The following Pr= imitive types are supported: = + . Types are associated with the columns in the tables. The following Pri= mitive types are supported: - * Integers + * Integers - * TINYINT - 1 byte integer + * TINYINT - 1 byte integer - * SMALLINT - 2 byte integer + * SMALLINT - 2 byte integer - * INT - 4 byte integer + * INT - 4 byte integer - * BIGINT - 8 byte integer + * BIGINT - 8 byte integer - * Boolean type + * Boolean type - * BOOLEAN - TRUE/FALSE + * BOOLEAN - TRUE/FALSE - * Floating point numbers + * Floating point numbers - * FLOAT - single precision = + * FLOAT - single precision - * DOUBLE - Double precision + * DOUBLE - Double precision - * String type + * String type - * STRING - sequence of characters in a specified character set + * STRING - sequence of characters in a specified character set + = =3D=3D=3D Type Conversion =3D=3D=3D The Types are organized in the following hierarchy (where the parent is a= super type of all the children instances): = - Type = + . Type - |=E2=86=92Primitive Type = + . |=E2=86=92Primitive Type - |=E2=86=92Number = - |=E2=86=92DOUBLE = - |=E2=86=92BIGINT = - |=E2=86=92INT = - |=E2=86=92TINYINT - |=E2=86=92FLOAT = - |=E2=86=92STRING = - |=E2=86=92BOOLEAN = + . |=E2=86=92Number + . |=E2=86=92DOUBLE + . |=E2=86=92BIGINT + . |=E2=86=92INT + . |=E2=86=92TINYINT + |=E2=86=92FLOAT + |=E2=86=92STRING + . |=E2=86=92BOOLEAN - |=E2=86=92Complex Type = + |=E2=86=92Complex Type + = This type hierarchy defines how the types are implicitly converted in the= query language. Implicit conversion is allowed for types from child to an = ancestor. So when a query expression expects type1 and the data is of type2= type2 is implicitly converted to type1 if type1 is an ancestor of type2 in= the type hierarchy. Apart from these fundamental rules for implicit conver= sion based on type system, Hive also allows the special case for conversion: = - * STRING =E2=86=92 DOUBLE + * STRING =E2=86=92 DOUBLE = - Explicit type conversion can be done using the cast operator as shown in = the [[Built in Functions|Built in Functions]] section below. + Explicit type conversion can be done using the cast operator as shown in = the [[#Built_in_functions|Built in Functions]] section below. = =3D=3D=3D Complex Types =3D=3D=3D Complex Types can be built up from primitive types and other composite ty= pes using: = - * Structs: the elements within the type can be accessed using the DOT= (.) notation. For example, for a column c of type STRUCT {a INT; b INT} th= e a field is accessed by the expression a.c + * Structs: the elements within the type can be accessed using the DOT (.= ) notation. For example, for a column c of type STRUCT {a INT; b INT} the a= field is accessed by the expression a.c - * Maps (key-value tuples): The elements are accessed using ['element = name'] notation. For example in a map M comprising of a mapping from 'group= ' -> gid the gid value can be accessed using M['group'] + * Maps (key-value tuples): The elements are accessed using ['element nam= e'] notation. For example in a map M comprising of a mapping from 'group' -= > gid the gid value can be accessed using M['group'] - * Arrays (indexable lists): The elements in the array have to be in t= he same type. Elements can be accessed using the [n] notation where n is an= index (zero-based) into the array. For example for an array A having the e= lements ['a', 'b', 'c'], A[1] retruns 'b'. = + * Arrays (indexable lists): The elements in the array have to be in the = same type. Elements can be accessed using the [n] notation where n is an in= dex (zero-based) into the array. For example for an array A having the elem= ents ['a', 'b', 'c'], A[1] retruns 'b'. = Using the primitive types and the constructs for creating complex types, = types with arbitrary levels of nesting can be created. For example, a type = User may comprise of the following fields: = - * id - which is a 4-byte INT. + * id - which is a 4-byte INT. - * name - which is a STRING. + * name - which is a STRING. - * age - which is an 1-byte TINYINT. + * age - which is an 1-byte TINYINT. - * weight - which is a FLOAT number. + * weight - which is a FLOAT number. - * friends - which is a array of ids (BIGINT). + * friends - which is a array of ids (BIGINT). - * gender - which is a STRING. + * gender - which is a STRING. - * active - which is a BOOLEAN. = + * active - which is a BOOLEAN. = =3D=3D Built in operators and functions =3D=3D =3D=3D=3D Built in operators =3D=3D=3D - *Relational Operators - The following operators compare the passed op= erands and generate a TRUE or FALSE value depending on whether the comparis= on between the operands holds or not. = + * Relational Operators - The following operators compare the passed oper= ands and generate a TRUE or FALSE value depending on whether the comparison= between the operands holds or not. + = - '''Relational Operators''' + . '''Relational Operators''' + = - || '''Operator''' || '''Operand types''' || '''Description''' || + ||'''Operator''' ||'''Operand types''' ||'''Description''' || - || A =3D B || all primitive types || TRUE if expression A is equal to exp= ression B otherwise FALSE. Note: SQL uses =3D but not =3D=3D|||| + ||A =3D B ||all primitive types ||TRUE if expression A is equal to expres= sion B otherwise FALSE. Note: SQL uses =3D but not =3D=3D || || - || A <> B || all primitive types || TRUE if expression A is NOT equal to = expression B otherwise FALSE|| + ||A <> B ||all primitive types ||TRUE if expression A is NOT equal to exp= ression B otherwise FALSE || - || A < B || all primitive types || TRUE if expression A is less than exp= ression B otherwise FALSE|| + ||A < B ||all primitive types ||TRUE if expression A is less than expres= sion B otherwise FALSE || - || A <=3D B || all primitive types || TRUE if expression A is less than o= r equal to expression B otherwise FALSE|| + ||A <=3D B ||all primitive types ||TRUE if expression A is less than or e= qual to expression B otherwise FALSE || - || A > B || all primitive types || TRUE if expression A is greater than e= xpression B otherwise FALSE|| + ||A > B ||all primitive types ||TRUE if expression A is greater than expr= ession B otherwise FALSE || - || A >=3D B || all primitive types || TRUE if expression A is greater tha= n or equal to expression B otherwise FALSE|| + ||A >=3D B ||all primitive types ||TRUE if expression A is greater than o= r equal to expression B otherwise FALSE || - || A IS NULL || all types || TRUE if expression A evaluates to NULL other= wise FALSE|| + ||A IS NULL ||all types ||TRUE if expression A evaluates to NULL otherwis= e FALSE || - || A IS NOT NULL || all types || FALSE if expression A evaluates to NULL = otherwise TRUE|| + ||A IS NOT NULL ||all types ||FALSE if expression A evaluates to NULL oth= erwise TRUE || - || A LIKE B || strings || TRUE if string A matches the SQL simple regular= expression B, otherwise FALSE. The comparison is done character by charact= er. The _ character in B matches any character in A (similar to . in posix = regular expressions), and the % character in B matches an arbitrary number = of characters in A (similar to .* in posix regular expressions). For exampl= e, 'foobar' like 'foo' evaluates to FALSE where as 'foobar' like {{{'foo___= '}}} evaluates to TRUE and so does 'foobar' like 'foo%'|| + ||A LIKE B ||strings ||TRUE if string A matches the SQL simple regular ex= pression B, otherwise FALSE. The comparison is done character by character.= The _ character in B matches any character in A (similar to . in posix reg= ular expressions), and the % character in B matches an arbitrary number of = characters in A (similar to .* in posix regular expressions). For example, = 'foobar' like 'foo' evaluates to FALSE where as 'foobar' like {{{'foo___'}}= } evaluates to TRUE and so does 'foobar' like 'foo%' || - || A RLIKE B || strings || TRUE if string A matches the Java regular exp= ression B (See [[http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pa= ttern.html|Java regular expressions syntax]]), otherwise FALSE. For example= , 'foobar' rlike 'foo' evaluates to FALSE whereas 'foobar' rlike '^f.*r$' e= valuates to TRUE|| + ||A RLIKE B ||strings ||TRUE if string A matches the Java regular express= ion B (See [[http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Patter= n.html|Java regular expressions syntax]]), otherwise FALSE. For example, 'f= oobar' rlike 'foo' evaluates to FALSE whereas 'foobar' rlike '^f.*r$' evalu= ates to TRUE || - || A REGEXP B || strings || Same as RLIKE || + ||A REGEXP B ||strings ||Same as RLIKE || - = + = + = - *Arithmetic Operators - The following operators support various commo= n arithmetic operations on the operands. All of them return number types. = + * Arithmetic Operators - The following operators support various common = arithmetic operations on the operands. All of them return number types. = ''' Arithmetic Operators ''' - || Operator || Operand types || Description|| + ||Operator ||Operand types ||Description || - || A + B || all number types || Gives the result of adding A and B. The t= ype of the result is the same as the common parent(in the type hierarchy) o= f the types of the operands. e.g. since every integer is a float, therefore= float is a containing type of integer so the + operator on a float and an = int will result in a float.|| + ||A + B ||all number types ||Gives the result of adding A and B. The type= of the result is the same as the common parent(in the type hierarchy) of t= he types of the operands. e.g. since every integer is a float, therefore fl= oat is a containing type of integer so the + operator on a float and an int= will result in a float. || - || A - B || all number types || Gives the result of subtracting B from A.= The type of the result is the same as the common parent(in the type hierar= chy) of the types of the operands.|| + ||A - B ||all number types ||Gives the result of subtracting B from A. Th= e type of the result is the same as the common parent(in the type hierarchy= ) of the types of the operands. || - || A * B || all number types || Gives the result of multiplying A and B. = The type of the result is the same as the common parent(in the type hierarc= hy) of the types of the operands. Note that if the multiplication causing o= verflow, you will have to cast one of the operators to a type higher in the= type hierarchy.|| + ||A * B ||all number types ||Gives the result of multiplying A and B. The= type of the result is the same as the common parent(in the type hierarchy)= of the types of the operands. Note that if the multiplication causing over= flow, you will have to cast one of the operators to a type higher in the ty= pe hierarchy. || - || A / B || all number types || Gives the result of dividing B from A. Th= e type of the result is the same as the common parent(in the type hierarchy= ) of the types of the operands. If the operands are integer types, then the= result is the quotient of the division.|| + ||A / B ||all number types ||Gives the result of dividing B from A. The t= ype of the result is the same as the common parent(in the type hierarchy) o= f the types of the operands. If the operands are integer types, then the re= sult is the quotient of the division. || - || A % B || all number types || Gives the reminder resulting from dividin= g A by B. The type of the result is the same as the common parent(in the ty= pe hierarchy) of the types of the operands.|| + ||A % B ||all number types ||Gives the reminder resulting from dividing A= by B. The type of the result is the same as the common parent(in the type = hierarchy) of the types of the operands. || - || A & B || all number types || Gives the result of bitwise AND of A and = B. The type of the result is the same as the common parent(in the type hier= archy) of the types of the operands.|| + ||A & B ||all number types ||Gives the result of bitwise AND of A and B. = The type of the result is the same as the common parent(in the type hierarc= hy) of the types of the operands. || - || A | B || all number types || Gives the result of bitwise OR of A and B= . The type of the result is the same as the common parent(in the type hiera= rchy) of the types of the operands.|| + ||A | B ||all number types ||Gives the result of bitwise OR of A and B. T= he type of the result is the same as the common parent(in the type hierarch= y) of the types of the operands. || - || A ^ B || all number types || Gives the result of bitwise XOR of A and = B. The type of the result is the same as the common parent(in the type hier= archy) of the types of the operands.|| + ||A ^ B ||all number types ||Gives the result of bitwise XOR of A and B. = The type of the result is the same as the common parent(in the type hierarc= hy) of the types of the operands. || - || ~A || all number types || Gives the result of bitwise NOT of A. The ty= pe of the result is the same as the type of A.|| + ||~A ||all number types ||Gives the result of bitwise NOT of A. The type = of the result is the same as the type of A. || = + = + = + = - * Logical Operators - The following operators provide support for cre= ating logical expressions. All of them return boolean TRUE or FALSE dependi= ng upon the boolean values of the operands. + * Logical Operators - The following operators provide support for creati= ng logical expressions. All of them return boolean TRUE or FALSE depending = upon the boolean values of the operands. + = - ''' Logical Operators ''' + . ''' Logical Operators ''' + = - || A AND B || boolean || TRUE if both A and B are TRUE, otherwise FALSE|| + ||A AND B ||boolean ||TRUE if both A and B are TRUE, otherwise FALSE || - || A && B || boolean || Same as A AND B|| + ||A && B ||boolean ||Same as A AND B || - || A OR B || boolean || TRUE if either A or B or both are TRUE, otherwise= FALSE|| + ||A OR B ||boolean ||TRUE if either A or B or both are TRUE, otherwise FA= LSE || - || {{{ A || B }}} || boolean || Same as A OR B|| + ||{{{ A || B }}} ||boolean ||Same as A OR B || - || NOT A || boolean || TRUE if A is FALSE, otherwise FALSE|| + ||NOT A ||boolean ||TRUE if A is FALSE, otherwise FALSE || - || !A || boolean || Same as NOT A|| + ||!A ||boolean ||Same as NOT A || = = * Operators on Complex Types - The following operators provide mechanisms= to access elements in Complex Types = ''' Operators on Complex Types ''' - || Operator || Operand types || Description|| + ||Operator ||Operand types ||Description || - || A[n] || A is an Array and n is an int || returns the nth element in th= e array A. The first element has index 0 e.g. if A is an array comprising o= f ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'|| + ||A[n] ||A is an Array and n is an int ||returns the nth element in the a= rray A. The first element has index 0 e.g. if A is an array comprising of [= 'foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' || - || M[key] || M is a Map and key has type K || returns the value cor= responding to the key in the map e.g. if M is a map comprising of {'f' -> '= foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'|| + ||M[key] ||M is a Map and key has type K ||returns the value corres= ponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo= ', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar' || - || S.x || S is a struct || returns the x field of S e.g for struct foobar= {int foo, int bar} foobar.foo returns the integer stored in the foo field = of the struct.|| + ||S.x ||S is a struct ||returns the x field of S e.g for struct foobar {i= nt foo, int bar} foobar.foo returns the integer stored in the foo field of = the struct. || + = + = + = = =3D=3D=3D Built in functions =3D=3D=3D - *The following built in functions are supported in hive: - [[http://svn.apache.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/h= adoop/hive/ql/exec/FunctionRegistry.java?view=3Dmarkup|List of functions in= source code: FunctionRegistry.java]] + *The following built in functions are supported in hive: [[http://svn.apa= che.org/viewvc/hadoop/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec= /FunctionRegistry.java?view=3Dmarkup|List of functions in source code: Func= tionRegistry.java]] = ''' Built in Functions ''' - || Return Type || Name(Signature) || Description|| + ||Return Type ||Name(Signature) ||Description || - || BIGINT || round(double a) || returns the rounded BIGINT value of the d= ouble|| + ||BIGINT ||round(double a) ||returns the rounded BIGINT value of the doub= le || - || BIGINT || floor(double a) || returns the maximum BIGINT value that is = equal or less than the double|| + ||BIGINT ||floor(double a) ||returns the maximum BIGINT value that is equ= al or less than the double || - || BIGINT || ceil(double a) || returns the minimum BIGINT value that is e= qual or greater than the double|| + ||BIGINT ||ceil(double a) ||returns the minimum BIGINT value that is equa= l or greater than the double || - || double || rand(), rand(int seed) || returns a random number (that chan= ges from row to row). Specifiying the seed will make sure the generated ran= dom number sequence is deterministic.|| + ||double ||rand(), rand(int seed) ||returns a random number (that changes= from row to row). Specifiying the seed will make sure the generated random= number sequence is deterministic. || - || string || concat(string A, string B)|| returns the string resulting fr= om concatenating B after A e.g. concat('foo', 'bar') results in 'foobar'|| + ||string ||concat(string A, string B) ||returns the string resulting from= concatenating B after A e.g. concat('foo', 'bar') results in 'foobar' || - || string || substr(string A, int start) || returns the substring of A st= arting from start position till the end of string A e.g. substr('foobar', 4= ) results in 'bar'|| + ||string ||substr(string A, int start) ||returns the substring of A start= ing from start position till the end of string A e.g. substr('foobar', 4) r= esults in 'bar' || - || string || substr(string A, int start, int length) || returns the subst= ring of A starting from start position with the given length e.g. substr('f= oobar', 4, 2) results in 'ba'|| + ||string ||substr(string A, int start, int length) ||returns the substrin= g of A starting from start position with the given length e.g. substr('foob= ar', 4, 2) results in 'ba' || - || string || upper(string A)|| returns the string resulting from converti= ng all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBA= R'|| + ||string ||upper(string A) ||returns the string resulting from converting= all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBAR'= || - || string || ucase(string A) || Same as upper|| + ||string ||ucase(string A) ||Same as upper || - || string || lower(string A) || returns the string resulting from convert= ing all characters of B to lower case e.g. lower('fOoBaR') results in 'foob= ar'|| + ||string ||lower(string A) ||returns the string resulting from converting= all characters of B to lower case e.g. lower('fOoBaR') results in 'foobar'= || - || string || lcase(string A) || Same as lower|| + ||string ||lcase(string A) ||Same as lower || - || string || trim(string A) || returns the string resulting from trimming= spaces from both ends of A e.g. trim(' foobar ') results in 'foobar'|| + ||string ||trim(string A) ||returns the string resulting from trimming sp= aces from both ends of A e.g. trim(' foobar ') results in 'foobar' || - || string || ltrim(string A) || returns the string resulting from trimmin= g spaces from the beginning(left hand side) of A e.g. ltrim(' foobar ')= results in 'foobar '|| + ||string ||ltrim(string A) ||returns the string resulting from trimming s= paces from the beginning(left hand side) of A e.g. ltrim(' foobar ') re= sults in 'foobar ' || - || string || rtrim(string A) || returns the string resulting from trimmin= g spaces from the end(right hand side) of A e.g. rtrim(' foobar ') resu= lts in ' foobar'|| + ||string ||rtrim(string A) ||returns the string resulting from trimming s= paces from the end(right hand side) of A e.g. rtrim(' foobar ') results= in ' foobar' || - || string || regexp_replace(string A, string B, string C) || returns the = string resulting from replacing all substrings in B that match the Java reg= ular expression syntax(See [[http://java.sun.com/j2se/1.4.2/docs/api/java/u= til/regex/Pattern.html|Java regular expressions syntax]]) with C e.g. regex= p_replace('foobar', 'oo|ar', '') returns 'fb'|| + ||string ||regexp_replace(string A, string B, string C) ||returns the str= ing resulting from replacing all substrings in B that match the Java regula= r expression syntax(See [[http://java.sun.com/j2se/1.4.2/docs/api/java/util= /regex/Pattern.html|Java regular expressions syntax]]) with C e.g. regexp_r= eplace('foobar', 'oo|ar', '') returns 'fb''' ''''' || - || int || size(Map) || returns the number of elements in the map typ= e|| + ||int ||size(Map) ||returns the number of elements in the map type || - || int || size(Array) || returns the number of elements in the array t= ype|| + ||int ||size(Array) ||returns the number of elements in the array type= || - || || cast(expr as ) || converts the results of the expressi= on expr to e.g. cast('1' as BIGINT) will convert the string '1' to i= t integral representation. A null is returned if the conversion does not su= cceed.|| + ||'''Expected "=3D" to follow "type"''' ||cast(expr as ) ||converts= the results of the expression expr to e.g. cast('1' as BIGINT) will= convert the string '1' to it integral representation. A null is returned i= f the conversion does not succeed. || - || string || from_unixtime(int unixtime) || convert the number of seconds= from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the tim= estamp of that moment in the current system time zone in the format of "197= 0-01-01 00:00:00"|| + ||string ||from_unixtime(int unixtime) ||convert the number of seconds fr= om unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timest= amp of that moment in the current system time zone in the format of "1970-0= 1-01 00:00:00" || - || string || to_date(string timestamp) || Return the date part of a times= tamp string: to_date("1970-01-01 00:00:00") =3D "1970-01-01"|| + ||string ||to_date(string timestamp) ||Return the date part of a timestam= p string: to_date("1970-01-01 00:00:00") =3D "1970-01-01" || - || int || year(string date) || Return the year part of a date or a timest= amp string: year("1970-01-01 00:00:00") =3D 1970, year("1970-01-01") =3D 19= 70|| + ||int ||year(string date) ||Return the year part of a date or a timestamp= string: year("1970-01-01 00:00:00") =3D 1970, year("1970-01-01") =3D 1970 = || - || int || month(string date) || Return the month part of a date or a time= stamp string: month("1970-11-01 00:00:00") =3D 11, month("1970-11-01") =3D = 11|| + ||int ||month(string date) ||Return the month part of a date or a timesta= mp string: month("1970-11-01 00:00:00") =3D 11, month("1970-11-01") =3D 11 = || - || int || day(string date) || Return the day part of a date or a timestam= p string: day("1970-11-01 00:00:00") =3D 1, day("1970-11-01") =3D 1|| + ||int ||day(string date) ||Return the day part of a date or a timestamp s= tring: day("1970-11-01 00:00:00") =3D 1, day("1970-11-01") =3D 1 || - ||string ||get_json_object(string json_string, string path) ||Extract jso= n object from a json string based on json path specified, and return json s= tring of the extracted json object. It will return null if the input json s= tring is invalid|| + ||string ||get_json_object(string json_string, string path) ||Extract jso= n object from a json string based on json path specified, and return json s= tring of the extracted json object. It will return null if the input json s= tring is invalid || + = + = + = = *The following built in aggregate functions are supported in hive: = - ''' Built in Aggregate Functions ''' + . Built in Aggregate Functions ''' ''' + = - || Return Type || Name(Signature) || Description|| + ||Return Type ||Name(Signature) ||Description || - || BIGINT || count(1), count(DISTINCT col [, col]...)|| count(1) returns = the number of members in the group, whereas the count(DISTINCT col) gets th= e count of distinct values of the columns in the group|| + ||BIGINT ||count(1), count(DISTINCT col [, col]...) ||count(1) returns th= e number of members in the group, whereas the count(DISTINCT col) gets the = count of distinct values of the columns in the group || - || DOUBLE || sum(col), sum(DISTINCT col) || returns the sum of the elemen= ts in the group or the sum of the distinct values of the column in the grou= p|| + ||DOUBLE ||sum(col), sum(DISTINCT col) ||returns the sum of the elements = in the group or the sum of the distinct values of the column in the group || - || DOUBLE || avg(col), avg(DISTINCT col) || returns the average of the el= ements in the group or the average of the distinct values of the column in = the group|| + ||DOUBLE ||avg(col), avg(DISTINCT col) ||returns the average of the eleme= nts in the group or the average of the distinct values of the column in the= group || - || DOUBLE || min(col) || returns the minimum value of the column in the g= roup|| + ||DOUBLE ||min(col) ||returns the minimum value of the column in the grou= p || - || DOUBLE || max(col) || returns the maximum value of the column in the g= roup|| + ||DOUBLE ||max(col) ||returns the maximum value of the column in the grou= p || + = = =3D=3D Language capabilities =3D=3D - Hive query language provides the basic SQL like operations. These operati= ons work on tables or partitions. These operations are: + '''Hive query language provides the basic SQL like operations. These oper= ations work on tables or partitions. These operations are: ''' + = - * Ability to filter rows from a table using a where clause. + * '''Ability to filter rows from a table using a where clause. ''' - * Ability to select certain columns from the table using a select cla= use. + * '''Ability to select certain columns from the table using a select cla= use. ''' - * Ability to do equi-joins between two tables. + * '''Ability to do equi-joins between two tables. ''' - * Ability to evaluate aggregations on multiple "group by" columns for= the data stored in a table. + * '''Ability to evaluate aggregations on multiple "group by" columns for= the data stored in a table. ''' - * Ability to store the results of a query into another table. + * '''Ability to store the results of a query into another table. ''' - * Ability to download the contents of a table to a local (e.g., nfs) = directory. + * '''Ability to download the contents of a table to a local (e.g., nfs) = directory. ''' - * Ability to store the results of a query in a hadoop dfs directory. + * '''Ability to store the results of a query in a hadoop dfs directory. = ''' - * Ability to manage tables and partitions (create, drop and alter). = + * '''Ability to manage tables and partitions (create, drop and alter). '= '' - * Ability to plug in custom scripts in the language of choice for cus= tom map/reduce jobs. + * '''Ability to plug in custom scripts in the language of choice for cus= tom map/reduce jobs. ''' = =3D Usage and Examples =3D - The following examples highlight some salient features of the system. A d= etailed set of query test cases can be found at [[http://svn.apache.org/vie= wvc/hadoop/hive/trunk/ql/src/test/queries/clientpositive/|Hive Query Test C= ases]] and the corresponding results can be found at [[http://svn.apache.or= g/viewvc/hadoop/hive/trunk/ql/src/test/results/clientpositive/|Query Test C= ase Results]] + '''The following examples highlight some salient features of the system. = A detailed set of query test cases can be found at [[http://svn.apache.org/= viewvc/hadoop/hive/trunk/ql/src/test/queries/clientpositive/|Hive Query Tes= t Cases]] and the corresponding results can be found at [[http://svn.apache= .org/viewvc/hadoop/hive/trunk/ql/src/test/results/clientpositive/|Query Tes= t Case Results]] ''' = =3D=3D Creating Tables =3D=3D - An example statement that would create the page_view table mentioned abov= e would be like: + '''An example statement that would create the page_view table mentioned a= bove would be like: ''' - {{{ = - CREATE TABLE page_view(viewTime INT, userid BIGINT, - page_url STRING, referrer_url STRING, = - ip STRING COMMENT 'IP Address of the User') = - COMMENT 'This is the page view table' = - PARTITIONED BY(dt STRING, country STRING) = - STORED AS SEQUENCEFILE; = - }}} = + {{{ + }}} - In this example the columns of the table are specified with the correspon= ding types. Comments can be attached both at the column level as well as at= the table level. Additionally the partitioned by clause defines the partit= ioning columns which are different from the data columns and are actually n= ot stored with the data. When specified in this way, the data in the files = is assumed to be delimited with ascii 001(ctrl-A) used as the field delimit= er and newline used as a row delimiter. + '''In this example the columns of the table are specified with the corres= ponding types. Comments can be attached both at the column level as well as= at the table level. Additionally the partitioned by clause defines the par= titioning columns which are different from the data columns and are actuall= y not stored with the data. When specified in this way, the data in the fil= es is assumed to be delimited with ascii 001(ctrl-A) used as the field deli= miter and newline used as a row delimiter. ''' = - These delimiters can be parametrized if the data is not in the above form= at as illustrated in the following example: = + '''These delimiters can be parametrized if the data is not in the above f= ormat as illustrated in the following example: ''' - {{{ = - CREATE TABLE page_view(viewTime INT, userid BIGINT, - page_url STRING, referrer_url STRING, = - ip STRING COMMENT 'IP Address of the User') = - COMMENT 'This is the page view table' = - PARTITIONED BY(dt STRING, country STRING) = - ROW FORMAT DELIMITED - FIELDS TERMINATED BY '1' = - LINES TERMINATED BY '12' = - STORED AS SEQUENCEFILE; = - }}} = + {{{ + }}} - The ROW FORMAT clause allows the user to specify both the field delimiter= s as well as the line delimiters. + '''The ROW FORMAT clause allows the user to specify both the field delimi= ters as well as the line delimiters. ''' = - It is also a good idea to bucket the tables on certain columns so that ef= ficient sampling queries can be executed against the data set (note: If buc= keting is absent, random sampling can still be done on the table). The foll= owing example illustrates the case of the page_view table which is bucketed= on userid column: = + '''It is also a good idea to bucket the tables on certain columns so that= efficient sampling queries can be executed against the data set (note: If = bucketing is absent, random sampling can still be done on the table). The f= ollowing example illustrates the case of the page_view table which is bucke= ted on userid column: ''' + = + {{{ - {{{ = - CREATE TABLE page_view(viewTime INT, userid BIGINT, - page_url STRING, referrer_url STRING, = - ip STRING COMMENT 'IP Address of the User') = - COMMENT 'This is the page view table' = - PARTITIONED BY(dt STRING, country STRING) = - CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS = - ROW FORMAT DELIMITED - FIELDS TERMINATED BY '1' = - COLLECTION ITEMS TERMINATED BY '2' = - MAP KEYS TERMINATED BY '3' = - LINES TERMINATED BY '12' = - STORED AS SEQUENCEFILE; = }}} - In the example above, the table is bucketed (clustered by) userid and wit= hin each bucket the data is sorted in the increasing order of viewTime. Suc= h an organization allows the user to do efficient sampling on the clustered= column - in this case userid. The sorting property allows internal operato= rs to take advantage of the better-known data structure while evaluating qu= eries, also increasing efficiency. + '''In the example above, the table is bucketed (clustered by) userid and = within each bucket the data is sorted in the increasing order of viewTime. = Such an organization allows the user to do efficient sampling on the cluste= red column - in this case userid. The sorting property allows internal oper= ators to take advantage of the better-known data structure while evaluating= queries, also increasing efficiency. ''' - {{{ = - CREATE TABLE page_view(viewTime INT, userid BIGINT, - page_url STRING, referrer_url STRING, = - friends ARRAY, properties MAP = - ip STRING COMMENT 'IP Address of the User') = - COMMENT 'This is the page view table' = - PARTITIONED BY(dt STRING, country STRING) = - CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS = - ROW FORMAT DELIMITED - FIELDS TERMINATED BY '1' = - COLLECTION ITEMS TERMINATED BY '2' = - MAP KEYS TERMINATED BY '3' = - LINES TERMINATED BY '12' = - STORED AS SEQUENCEFILE; = - }}} = + {{{ + }}} - In this example the columns that comprise of the table row are specified = in a similar way as the definition of types. Comments can be attached both = at the column level as well as at the table level. Additionally the partiti= oned by clause defines the partitioning columns which are different from th= e data columns and are actually not stored with the data. The bucketed on c= lause specifies which column to use for bucketing as well as how many bucke= ts to create. The delimited row format specifies how the rows are stored in= the hive table. In the case of the delimited format, this specifies how th= e fields are terminated, how the items within collections (arrays or maps) = are terminated and how the map keys are terminated. STORED AS SEQUENCEFILE = indicates that this data is stored in a binary format (using hadoop Sequenc= eFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses = in the above example represent the system defaults. + '''In this example the columns that comprise of the table row are specifi= ed in a similar way as the definition of types. Comments can be attached bo= th at the column level as well as at the table level. Additionally the part= itioned by clause defines the partitioning columns which are different from= the data columns and are actually not stored with the data. The bucketed o= n clause specifies which column to use for bucketing as well as how many bu= ckets to create. The delimited row format specifies how the rows are stored= in the hive table. In the case of the delimited format, this specifies how= the fields are terminated, how the items within collections (arrays or map= s) are terminated and how the map keys are terminated. STORED AS SEQUENCEFI= LE indicates that this data is stored in a binary format (using hadoop Sequ= enceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS claus= es in the above example represent the system defaults. ''' = - Table names and column names are case insensitive. + '''Table names and column names are case insensitive. ''' = =3D=3D Browsing Tables and Partitions =3D=3D - {{{ = + {{{ - SHOW TABLES; = }}} - = - To list existing tables in the warehouse; there are many of these, likely= more than you want to browse. = + '''To list existing tables in the warehouse; there are many of these, lik= ely more than you want to browse. ''' = - {{{ = + {{{ - SHOW TABLES 'page.*'; = }}} - = - To list tables with prefix 'page'. The pattern follows Java regular expre= ssion syntax (so the period is a wildcard). = + '''To list tables with prefix 'page'. The pattern follows Java regular ex= pression syntax (so the period is a wildcard). ''' = - {{{ = + {{{ - SHOW PARTITIONS page_view; = }}} - = - To list partitions of a table. If the table is not a partitioned table th= en an error is thrown. = + '''To list partitions of a table. If the table is not a partitioned table= then an error is thrown. ''' = - {{{ = + {{{ - DESCRIBE page_view; = }}} - = - To list columns and column types of table. = + '''To list columns and column types of table. ''' = - {{{ = + {{{ - DESCRIBE EXTENDED page_view; = }}} - = - To list columns and all other properties of table. This prints lot of inf= ormation and that too not in a pretty format. Usually used for debugging. = + '''To list columns and all other properties of table. This prints lot of = information and that too not in a pretty format. Usually used for debugging= . ''' = - {{{ = + {{{ - DESCRIBE EXTENDED page_view PARTITION (ds=3D'2008-08-08'); = }}} - = - To list columns and all other properties of a partition. This also prints= lot of information which is usually used for debugging. = + '''To list columns and all other properties of a partition. This also pri= nts lot of information which is usually used for debugging. ''' = =3D=3D Loading Data =3D=3D - There are multiple mechanisms of loading data into Hive tables. The user = can create an external table that points to a specified location within hdf= s. In this particular usage, the user can copy a file into the specified lo= cation using the hdfs put or copy commands and create a table pointing to t= his location with all the relevant row format information. Once this is don= e, the user can transform this data and insert into any other Hive table. e= .g. if the file /tmp/pv_2008-06-08.txt contains comma separated page views = served on 2008-06-08, and this needs to be loaded into the page_view table = in the appropriate partition, the following sequence of commands can achiev= e this: = + '''There are multiple mechanisms of loading data into Hive tables. The us= er can create an external table that points to a specified location within = hdfs. In this particular usage, the user can copy a file into the specified= location using the hdfs put or copy commands and create a table pointing t= o this location with all the relevant row format information. Once this is = done, the user can transform this data and insert into any other Hive table= . e.g. if the file /tmp/pv_2008-06-08.txt contains comma separated page vie= ws served on 2008-06-08, and this needs to be loaded into the page_view tab= le in the appropriate partition, the following sequence of commands can ach= ieve this: ''' = + {{{ - {{{ = - CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, - page_url STRING, referrer_url STRING, = - ip STRING COMMENT 'IP Address of the User', = - country STRING COMMENT 'country of origination') = - COMMENT 'This is the staging page view table' = - ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY '1= 2' = - STORED AS TEXTFILE = - LOCATION '/user/data/stagging/page_view'; = - = - hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view = - = - FROM page_view_stg pvs = - INSERT OVERWRITE TABLE page_view PARTITION(dt=3D'2008-06-08', country= =3D'US') = - SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null= , null, pvs.ip = - WHERE pvs.country =3D 'US'; = }}} - = - In the example above nulls are inserted for the array and map types in th= e destination tables but potentially these can also come from the external = table if the proper row formats are specified. + '''In the example above nulls are inserted for the array and map types in= the destination tables but potentially these can also come from the extern= al table if the proper row formats are specified. ''' = - This method is useful if there is already legacy data in hdfs on which th= e user wants to put some metadata so that that the data can be queried and = manipulated using hive. + '''This method is useful if there is already legacy data in hdfs on which= the user wants to put some metadata so that that the data can be queried a= nd manipulated using hive. ''' = - Additionally, the system also supports syntax that can load the data from= a file in the local files system directly into a hive table where the inpu= t data format is same as the table format. If /tmp/pv_2008-06-08_us.txt alr= eady contains the data for US, then we do not need any additional filtering= as shown in the previous example. The load in this case can be done using = the following syntax: = + '''Additionally, the system also supports syntax that can load the data f= rom a file in the local files system directly into a hive table where the i= nput data format is same as the table format. If /tmp/pv_2008-06-08_us.txt = already contains the data for US, then we do not need any additional filter= ing as shown in the previous example. The load in this case can be done usi= ng the following syntax: ''' + = - {{{ = + {{{ - LOAD DATA LOCAL INPATH `/tmp/pv_2008-06-08_us.txt` INTO TABLE page_vie= w PARTITION(date=3D'2008-06-08', country=3D'US') = }}} - The path argument can take a directory (in which case all the files in th= e directory are loaded), a single file name, or a wildcard (in which case a= ll the matching files are uploaded). If the argument is a directory - it ca= nnot contain subdirectories. Similarly - the wildcard must match file names= only. + '''The path argument can take a directory (in which case all the files in= the directory are loaded), a single file name, or a wildcard (in which cas= e all the matching files are uploaded). If the argument is a directory - it= cannot contain subdirectories. Similarly - the wildcard must match file na= mes only. ''' = - In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, = the user may decide to do a parallel load of the data (using tools that are= external to Hive). Once the file is in HDFS - the following syntax can be = used to load the data into a Hive table: = + '''In the case that the input file /tmp/pv_2008-06-08_us.txt is very larg= e, the user may decide to do a parallel load of the data (using tools that = are external to Hive). Once the file is in HDFS - the following syntax can = be used to load the data into a Hive table: ''' + = - {{{ = + {{{ - LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_vie= w PARTITION(date=3D'2008-06-08', country=3D'US') = }}} - It is assumed that the array and map fields in the input.txt files are nu= ll fields for these examples. = + '''It is assumed that the array and map fields in the input.txt files are= null fields for these examples. ''' = =3D=3D Simple Query =3D=3D - For all the active users, one can use the query of the following form: = + '''For all the active users, one can use the query of the following form:= ''' + = + {{{ - {{{ = - INSERT OVERWRITE TABLE user_active = - SELECT user.* = - FROM user = - WHERE user.active =3D 1; = }}} - Note that unlike SQL, we always insert the results into a table. We will = illustrate later how the user can inspect these results and even dump them = to a local file. = + '''Note that unlike SQL, we always insert the results into a table. We wi= ll illustrate later how the user can inspect these results and even dump th= em to a local file. ''' = - =3D=3D Partition Based Query =3D=3D + =3D=3D Partition Based Query =3D=3D - What partitions to use in a query is determined automatically by the syst= em on the basis of where clause conditions on partition columns. e.g. in or= der to get all the page_views in the month of 03/2008 referred from domain = xyz.com, one could write the following query: = + '''What partitions to use in a query is determined automatically by the s= ystem on the basis of where clause conditions on partition columns. e.g. in= order to get all the page_views in the month of 03/2008 referred from doma= in xyz.com, one could write the following query: ''' = + {{{ - {{{ = - INSERT OVERWRITE TABLE xyz_com_page_views = - SELECT page_views.* = - FROM page_views = - WHERE page_views.date >=3D '2008-03-01' AND page_views.date <=3D '200= 8-03-31' AND - page_views.referrer_url like '%xyz.com'; = }}} - (Note that page_views.date is used here because the table (above) was def= ined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your = partition something different, don't expect .date to do what you think!) = + '''(Note that page_views.date is used here because the table (above) was = defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name yo= ur partition something different, don't expect .date to do what you think!)= ''' = =3D=3D Joins =3D=3D - In order to get a demographic breakdown (by gender) of page_view of 2008-= 03-03 one would need to join the page_view table and the user table on the = userid column. This can be accomplished with a join as shown in the followi= ng query: = + '''In order to get a demographic breakdown (by gender) of page_view of 20= 08-03-03 one would need to join the page_view table and the user table on t= he userid column. This can be accomplished with a join as shown in the foll= owing query: ''' - {{{ = - INSERT OVERWRITE TABLE pv_users = - SELECT pv.*, u.gender, u.age = - FROM user u JOIN page_view pv ON (pv.userid =3D u.id) = - WHERE pv.date =3D '2008-03-03'; = - }}} = + {{{ + }}} - In order to do outer joins the user can qualify the join with LEFT OUTER,= RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer = join (left preserved, right preserved or both sides preserved). e.g. in ord= er to do a full outer join in the query above, the corresponding syntax wou= ld look like the following query: = + '''In order to do outer joins the user can qualify the join with LEFT OUT= ER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of out= er join (left preserved, right preserved or both sides preserved). e.g. in = order to do a full outer join in the query above, the corresponding syntax = would look like the following query: ''' + = + {{{ - {{{ = - INSERT OVERWRITE TABLE pv_users = - SELECT pv.*, u.gender, u.age = - FROM user u FULL OUTER JOIN page_view pv ON (pv.userid =3D u.id) = - WHERE pv.date =3D '2008-03-03'; = }}} - In order to join more than one tables, the user can use the following syn= tax: = + '''In order to join more than one tables, the user can use the following = syntax: ''' - {{{ = - INSERT OVERWRITE TABLE pv_friends = - SELECT pv.*, u.gender, u.age, f.friends = - FROM page_view pv JOIN user u ON (pv.userid =3D u.id) JOIN friend_lis= t f ON (u.id =3D f.uid) = - WHERE pv.date =3D '2008-03-03'; = - }}} = + {{{ + }}} - Note that Hive only supports [[http://en.wikipedia.org/wiki/Join_(SQL)#Eq= ui-join|equi-joins]]. Note also that it is best to put the largest table on= the rightmost side of the join in order to avoid memory errors. + '''Note that Hive only supports [[http://en.wikipedia.org/wiki/Join_(SQL)= #Equi-join|equi-joins]]. Note also that it is best to put the largest table= on the rightmost side of the join in order to avoid memory errors. ''' = =3D=3D Aggregations =3D=3D - In order to count the number of distinct users by gender one could write = the following query: = + '''In order to count the number of distinct users by gender one could wri= te the following query: ''' = + {{{ - {{{ = - INSERT OVERWRITE TABLE pv_gender_sum = - SELECT pv_users.gender, count (DISTINCT pv_users.userid) = - FROM pv_users = - GROUP BY pv_users.gender; = }}} - = - Multiple aggregations can be done at the same time, however, no two aggre= gations can have different DISTINCT columns .e.g while the following is pos= sible = + '''Multiple aggregations can be done at the same time, however, no two ag= gregations can have different DISTINCT columns .e.g while the following is = possible ''' = + {{{ - {{{ = - INSERT OVERWRITE TABLE pv_gender_agg = - SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), su= m(DISTINCT pv_users.userid) = - FROM pv_users = - GROUP BY pv_users.gender; = }}} - however, the following query is not allowed = + '''however, the following query is not allowed ''' = + {{{ - {{{ = - INSERT OVERWRITE TABLE pv_gender_agg = - SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTIN= CT pv_users.ip) = - FROM pv_users = - GROUP BY pv_users.gender; = }}} - = =3D=3D Multi Table/File Inserts =3D=3D - The output of the aggregations or simple selects can be further sent into= multiple tables or even to hadoop dfs files (which can then be manipulated= using hdfs utilities). e.g. if along with the gender breakdown, one needed= to find the breakdown of unique page views by age, one could accomplish th= at with the following query: = + '''The output of the aggregations or simple selects can be further sent i= nto multiple tables or even to hadoop dfs files (which can then be manipula= ted using hdfs utilities). e.g. if along with the gender breakdown, one nee= ded to find the breakdown of unique page views by age, one could accomplish= that with the following query: ''' - {{{ = - FROM pv_users = - INSERT OVERWRITE TABLE pv_gender_sum - SELECT pv_users.gender, count_distinct(pv_users.userid) = - GROUP BY pv_users.gender = = + {{{ - INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' - SELECT pv_users.age, count_distinct(pv_users.userid) = - GROUP BY pv_users.age; = }}} - The first insert clause sends the results of the first group by to a Hive= table while the second one sends the results to a hadoop dfs files. = + '''The first insert clause sends the results of the first group by to a H= ive table while the second one sends the results to a hadoop dfs files. ''' = - =3D=3D Inserting into local files =3D=3D + =3D=3D Inserting into local files =3D=3D - In certain situations you would want to write the output into a local fil= e so that you could load it into an excel spreadsheet. This can be accompli= shed with the following command: = + '''In certain situations you would want to write the output into a local = file so that you could load it into an excel spreadsheet. This can be accom= plished with the following command: ''' = + {{{ - {{{ = - INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum' = - SELECT pv_gender_sum.* = - FROM pv_gender_sum; = }}} - = =3D=3D Sampling =3D=3D - The sampling clause allows the users to write queries for samples of the = data instead of the whole table. Currently the sampling is done on the colu= mns that are specified in the BUCKETED ON clause of the CREATE TABLE statem= ent. In the following example we choose 3rd bucket out of the 32 buckets of= the pv_gender_sum table: = + '''The sampling clause allows the users to write queries for samples of t= he data instead of the whole table. Currently the sampling is done on the c= olumns that are specified in the BUCKETED ON clause of the CREATE TABLE sta= tement. In the following example we choose 3rd bucket out of the 32 buckets= of the pv_gender_sum table: ''' - {{{ = - INSERT OVERWRITE TABLE pv_gender_sum_sample = - SELECT pv_gender_sum.* = - FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32); = - }}} = + {{{ + }}} - In general the TABLESAMPLE syntax looks like: + '''In general the TABLESAMPLE syntax looks like: ''' = - {{{ = + {{{ - TABLESAMPLE(BUCKET x OUT OF y) = - }}} = + }}} - = - y has to be a multiple or divisor of the number of buckets in that table = as specified at the table creation time. The buckets chosen are determined = if bucket_number module y is equal to x. So in the above example the follow= ing tablesample clause + '''y has to be a multiple or divisor of the number of buckets in that tab= le as specified at the table creation time. The buckets chosen are determin= ed if bucket_number module y is equal to x. So in the above example the fol= lowing tablesample clause ''' = - {{{ = + {{{ - TABLESAMPLE(BUCKET 3 OUT OF 16) = - }}} = + }}} - = - would pick out the 3rd and 19th buckets. The buckets are numbered startin= g from 0. + '''would pick out the 3rd and 19th buckets. The buckets are numbered star= ting from 0. ''' = - On the other hand the tablesample clause + '''On the other hand the tablesample clause ''' = - {{{ = + {{{ - TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid) = - }}} = + }}} - = - would pick out half of the 3rd bucket. = + '''would pick out half of the 3rd bucket. ''' - = = =3D=3D Union all =3D=3D - The language also supports union all, e.g. if we suppose there are two di= fferent tables that track which user has published a video and which user h= as published a comment, the following query joins the results of a union al= l with the user table to create a single annotated stream for all the video= publishing and comment publishing events: + '''The language also supports union all, e.g. if we suppose there are two= different tables that track which user has published a video and which use= r has published a comment, the following query joins the results of a union= all with the user table to create a single annotated stream for all the vi= deo publishing and comment publishing events: ''' = + {{{ - {{{ = - INSERT OVERWRITE TABLE actions_users = - SELECT u.id, actions.date - FROM ( - SELECT av.uid AS uid = - FROM action_video av = - WHERE av.date =3D '2008-06-03' = - = - UNION ALL = - = - SELECT ac.uid AS uid = - FROM action_comment ac = - WHERE ac.date =3D '2008-06-03' = - ) actions JOIN users u ON(u.id =3D actions.uid); }}} - = =3D=3D Array Operations =3D=3D - Array columns in tables can only be created programmatically currently. W= e will be extending this soon to be available as part of the create table s= tatement. For the purpose of the current example assume that pv.friends is = of the type array i.e. it is an array of integers.The user can get a s= pecific element in the array by its index as shown in the following command: + '''Array columns in tables can only be created programmatically currently= . We will be extending this soon to be available as part of the create tabl= e statement. For the purpose of the current example assume that pv.friends = is of the type array i.e. it is an array of integers.The user can get = a specific element in the array by its index as shown in the following comm= and: ''' - = + = - {{{ = + {{{ - SELECT pv.friends[2] = - FROM page_views pv; = - }}} = + }}} - = - The select expressions gets the third item in the pv.friends array. + '''The select expressions gets the third item in the pv.friends array. ''' = - The user can also get the length of the array using the size function as = shown below: + '''The user can also get the length of the array using the size function = as shown below: ''' = - {{{ = + {{{ - SELECT pv.userid, size(pv.friends) = - FROM page_view pv; = - }}} = + }}} - = =3D=3D Map(Associative Arrays) Operations =3D=3D - = - Maps provide collections similar to associative arrays. Such structures c= an only be created programmatically currently. We will be extending this so= on. For the purpose of the current example assume that pv.properties is of = the type map i.e. it is an associative array from strings t= o string. Accordingly, the following query: + '''Maps provide collections similar to associative arrays. Such structure= s can only be created programmatically currently. We will be extending this= soon. For the purpose of the current example assume that pv.properties is = of the type map i.e. it is an associative array from string= s to string. Accordingly, the following query: ''' = - {{{ = + {{{ - INSERT OVERWRITE page_views_map = - SELECT pv.userid, pv.properties['page type'] = - FROM page_views pv; = - }}} = + }}} - = - can be used to select the 'page_type' property from the page_views table. + '''can be used to select the 'page_type' property from the page_views tab= le. ''' = - Similar to arrays, the size function can also be used to get the number o= f elements in a map as shown in the following query: + '''Similar to arrays, the size function can also be used to get the numbe= r of elements in a map as shown in the following query: ''' = {{{ - SELECT size(pv.properties) = - FROM page_view pv; = }}} - = =3D=3D Custom map/reduce scripts =3D=3D - Users can also plug in their own custom mappers and reducers in the data = stream by using features natively supported in the Hive language. e.g. in o= rder to run a custom mapper script - map_script - and a custom reducer scri= pt - reduce_script - the user can issue the following command which uses th= e TRANSFORM clause to embed the mapper and the reducer scripts. + '''Users can also plug in their own custom mappers and reducers in the da= ta stream by using features natively supported in the Hive language. e.g. i= n order to run a custom mapper script - map_script - and a custom reducer s= cript - reduce_script - the user can issue the following command which uses= the TRANSFORM clause to embed the mapper and the reducer scripts. ''' = - Note that columns will be transformed to string and delimited by TAB befo= re feeding to the user script, and the standard output of the user script w= ill be treated as TAB-separated string columns. User scripts can output deb= ug information to standard error which will be shown on the task detail pag= e on hadoop. + '''Note that columns will be transformed to string and delimited by TAB b= efore feeding to the user script, and the standard output of the user scrip= t will be treated as TAB-separated string columns. User scripts can output = debug information to standard error which will be shown on the task detail = page on hadoop. ''' = - {{{ = + {{{ - FROM ( - FROM pv_users = - MAP pv_users.userid, pv_users.date = - USING 'map_script' = - AS dt, uid = - CLUSTER BY dt) map_output = - = - INSERT OVERWRITE TABLE pv_users_reduced - REDUCE map_output.dt, map_output.uid = - USING 'reduce_script' = - AS date, count; = - }}} = + }}} - = - Sample map script (weekday_mapper.py ) + '''Sample map script (weekday_mapper.py ) ''' - {{{ - import sys - import datetime = + {{{ - for line in sys.stdin: - line =3D line.strip() - userid, unixtime =3D line.split('\t') - weekday =3D datetime.datetime.fromtimestamp(float(unixtime)).isoweekday= () - print ','.join([userid, str(weekday)]) }}} - = - Of course, both MAP and REDUCE are "syntactic sugar" for the more general= select transform. The inner query could also have been written as such: + '''Of course, both MAP and REDUCE are "syntactic sugar" for the more gene= ral select transform. The inner query could also have been written as such:= ''' = - {{{ = + {{{ - SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' A= S dt, uid CLUSTER BY dt FROM pv_users; - }}} = + }}} - = - Schema-less map/reduce: If there is no "AS" clause after "USING map_scrip= t", Hive assumes the output of the script contains 2 parts: key which is be= fore the first tab, and value which is the rest after the first tab. Note t= hat this is different from specifying "AS key, value" because in that case = value will only contains the portion between the first tab and the second t= ab if there are multiple tabs. + '''Schema-less map/reduce: If there is no "AS" clause after "USING map_sc= ript", Hive assumes the output of the script contains 2 parts: key which is= before the first tab, and value which is the rest after the first tab. Not= e that this is different from specifying "AS key, value" because in that ca= se value will only contains the portion between the first tab and the secon= d tab if there are multiple tabs. ''' = - In this way, we allow users to migrate old map/reduce scripts without kno= wing the schema of the map output. User still needs to know the reduce outp= ut schema because that has to match what is in the table that we are insert= ing to. + '''In this way, we allow users to migrate old map/reduce scripts without = knowing the schema of the map output. User still needs to know the reduce o= utput schema because that has to match what is in the table that we are ins= erting to. ''' = - {{{ = + {{{ - FROM ( - FROM pv_users = - MAP pv_users.userid, pv_users.date = - USING 'map_script' = - CLUSTER BY key) map_output = - = - INSERT OVERWRITE TABLE pv_users_reduced - = - REDUCE map_output.dt, map_output.uid = - USING 'reduce_script' = - AS date, count; = }}} - = - Distribute By and Sort By: Instead of specifying "cluster by", the user c= an specify "distribute by" and "sort by", so the partition columns and sort= columns can be different. The usual case is that the partition columns are= a prefix of sort columns, but that is not required. + '''Distribute By and Sort By: Instead of specifying "cluster by", the use= r can specify "distribute by" and "sort by", so the partition columns and s= ort columns can be different. The usual case is that the partition columns = are a prefix of sort columns, but that is not required. ''' = - {{{ = + {{{ - FROM ( - FROM pv_users = - MAP pv_users.userid, pv_users.date = - USING 'map_script' = - AS c1, c2, c3 = - DISTRIBUTE BY c2 = - SORT BY c2, c1) map_output = - = - INSERT OVERWRITE TABLE pv_users_reduced - = - REDUCE map_output.c1, map_output.c2, map_output.c3 = - USING 'reduce_script' = - AS date, count; = }}} - = =3D=3D Co groups =3D=3D - Amongst the user community using map/reduce, cogroup is a fairly common o= peration wherein the data from multiple tables are sent to a custom reducer= such that the rows are grouped by the values of certain columns on the tab= les. With the UNION ALL operator and the CLUSTER BY specification, this can= be achieved in the Hive query language in the following way. Suppose we wa= nted to cogroup the rows from the actions_video and action_comments table o= n the uid column and send them to the 'reduce_script' custom reducer, the f= ollowing syntax can be used by the user: + '''Amongst the user community using map/reduce, cogroup is a fairly commo= n operation wherein the data from multiple tables are sent to a custom redu= cer such that the rows are grouped by the values of certain columns on the = tables. With the UNION ALL operator and the CLUSTER BY specification, this = can be achieved in the Hive query language in the following way. Suppose we= wanted to cogroup the rows from the actions_video and action_comments tabl= e on the uid column and send them to the 'reduce_script' custom reducer, th= e following syntax can be used by the user: ''' = - {{{ = + {{{ - FROM ( - FROM ( - FROM action_video av = - SELECT av.uid AS uid, av.id AS id, av.date AS date = - = - UNION ALL = - = - FROM action_comment ac = - SELECT ac.uid AS uid, ac.id AS id, ac.date AS date = - ) union_actions = - SELECT union_actions.uid, union_actions.id, union_actions.date = - CLUSTER BY union_actions.uid) map = - = - INSERT OVERWRITE TABLE actions_reduced - SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script'= AS (uid, id, reduced_val); = }}} - = =3D=3D Altering Tables =3D=3D - To rename existing table to a new name. If a table with new name already = exists then an error is returned: = + '''To rename existing table to a new name. If a table with new name alrea= dy exists then an error is returned: ''' + = - {{{ = + {{{ - ALTER TABLE old_table_name RENAME TO new_table_name; = }}} - = - To rename the columns of an existing table. Be sure to use the same colum= n types, and to include an entry for each preexisting column: + '''To rename the columns of an existing table. Be sure to use the same co= lumn types, and to include an entry for each preexisting column: ''' = - {{{ = + {{{ - ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...); = - }}} = + }}} - = - To add columns to an existing table: + '''To add columns to an existing table: ''' = - {{{ = + {{{ - ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 S= TRING DEFAULT 'def val'); = - }}} = + }}} - = - Note that a change in the schema (such as the adding of the columns), pre= serves the schema for the old partitions of the table in case it is a parti= tioned table. All the queries that access these columns and run over the ol= d partitions implicitly return a null value or the specified default values= for these columns. = + '''Note that a change in the schema (such as the adding of the columns), = preserves the schema for the old partitions of the table in case it is a pa= rtitioned table. All the queries that access these columns and run over the= old partitions implicitly return a null value or the specified default val= ues for these columns. ''' = - In the later versions we can make the behavior of assuming certain values= as opposed to throwing an error in case the column is not found in a parti= cular partition configurable. = + '''In the later versions we can make the behavior of assuming certain val= ues as opposed to throwing an error in case the column is not found in a pa= rticular partition configurable. ''' = =3D=3D Dropping Tables and Partitions =3D=3D - Dropping tables is fairly trivial. A drop on the table would implicitly d= rop any indexes(this is a future feature) that would have been built on the= table. The associated command is + '''Dropping tables is fairly trivial. A drop on the table would implicitl= y drop any indexes(this is a future feature) that would have been built on = the table. The associated command is ''' = - {{{ = + {{{ - DROP TABLE pv_users; = }}} - = - To dropping a partition. Alter the table to drop the partition. + '''To dropping a partition. Alter the table to drop the partition. ''' = - {{{ = + {{{ - ALTER TABLE pv_users DROP PARTITION (ds=3D'2008-08-08') = }}} - = - ''' Note that any data for this table or partitions will be dropped and m= ay not be recoverable. ''' + Note that any data for this table or partitions will be dropped and may n= ot be recoverable. ''' ''' =20