Return-Path: Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: (qmail 28630 invoked from network); 15 Mar 2010 15:39:30 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 15 Mar 2010 15:39:30 -0000 Received: (qmail 50402 invoked by uid 500); 15 Mar 2010 15:38:43 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 50279 invoked by uid 500); 15 Mar 2010 15:38:42 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 50272 invoked by uid 500); 15 Mar 2010 15:38:42 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 50269 invoked by uid 99); 15 Mar 2010 15:38:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Mar 2010 15:38:42 +0000 X-ASF-Spam-Status: No, hits=-1014.0 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Mar 2010 15:38:41 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 4182717621; Mon, 15 Mar 2010 15:38:21 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Mon, 15 Mar 2010 15:38:21 -0000 Message-ID: <20100315153821.14215.89558@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Hive/HiveUDFGuide=22_by_patricka?= =?utf-8?q?ngeles?= Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "Hive/HiveUDFGuide" page has been changed by patrickangeles. http://wiki.apache.org/hadoop/Hive/HiveUDFGuide -------------------------------------------------- New page: Documentation for Built-In User-Defined Functions =3D=3D UDFs =3D=3D =3D=3D=3D xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_dou= ble, xpath_number, xpath_string =3D=3D=3D * Functions for parsing XML data using XPath expressions. * Since version: 0.6.0 =3D=3D=3D=3D Overview =3D=3D=3D=3D The ''xpath'' family of UDFs are wrappers around the Java XPath library {{{= javax.xml.xpath}}} provided by the JDK. The library is based on the XPath 1= .0 specification. Please refer to http://java.sun.com/javase/6/docs/api/jav= ax/xml/xpath/package-summary.html for detailed information on the Java XPat= h library. All functions follow the form: {{{xpath_*(xml_string, xpath_expression_stri= ng)}}}. The XPath expression string is compiled and cached. It is reused if= the expression in the next input row matches the previous. Otherwise, it i= s recompiled. So, the xml string is always parsed for every input row, but = the xpath expression is precompiled and reused for the vast majority of use= cases. Backward axes are supported. For example: {{{ > select xpath ('','/descend= ant::c/ancestor::b/@id') from t1 limit 1 ; ["1","2"] }}} Each function returns a specific Hive type given the XPath expression: * {{{xpath}}} returns a Hive array of strings. * {{{xpath_string}}} returns a string. * {{{xpath_boolean}}} returns a boolean. * {{{xpath_short}}} returns a short integer. * {{{xpath_int}}} returns an integer. * {{{xpath_long}}} returns a long integer. * {{{xpath_float}}} returns a floating point number. * {{{xpath_double,xpath_number}}} returns a double-precision floating poin= t number ({{{xpath_number}}} is an alias for {{{xpath_double}}}). The UDFs are schema agnostic - no XML validation is performed. However, mal= formed xml (e.g., {{{1}}}) will result in a runtime exceptio= n being thrown. Following are specifics on each xpath UDF variant. =3D=3D=3D=3D xpath =3D=3D=3D=3D The {{{xpath()}}} function always returns a hive array of strings. If the e= xpression results in a non-text value (e.g., another xml node) the function= will return an empty array. There are 2 primary uses for this function: to= get a list of node text values or to get a list of attribute values. Examples: Non-matching XPath expression: {{{ > select xpath('b1b2','a/*') from src limit 1 ; [] }}} Get a list of node text values: {{{ > select xpath('b1b2','a/*/text()') from src limit 1 ; ["b1","b2"] }}} Get a list of values for attribute 'id': {{{ > select xpath('b1b2','//@id') f= rom src limit 1 ; = ["foo","bar"] }}} Get a list of node texts for nodes where the 'class' attribute equals 'bb': {{{ > SELECT xpath ('b1b2b3c1c2', 'a/*[@class=3D"bb"]/text()') FROM src LIMIT 1 ; ["b1","c1"] }}} =3D=3D=3D=3D xpath_string =3D=3D=3D=3D The {{{xpath_string()}}} function returns the text of the first matching no= de. Get the text for node 'a/b': {{{ > SELECT xpath_string ('bbcc', 'a/b') FROM src LIMIT 1= ; bb }}} Get the text for node 'a'. Because 'a' has children nodes with text, the re= sult is a composite of text from the children. {{{ > SELECT xpath_string ('bbcc', 'a') FROM src LIMIT 1 ; bbcc }}} Non-matching expression returns an empty string: {{{ > SELECT xpath_string ('bbcc', 'a/d') FROM src LIMIT 1= ; }}} Gets the text of the first node that matches '//b': {{{ > SELECT xpath_string ('b1b2', '//b') FROM src LIMIT 1= ; b1 }}} Gets the second matching node: {{{ > SELECT xpath_string ('b1b2', 'a/b[2]') FROM src LIMI= T 1 ; b2 }}} Gets the text from the first node that has an attribute 'id' with value 'b_= 2': {{{ > SELECT xpath_string ('b1b2', 'a/b[@id=3D"= b_2"]') FROM src LIMIT 1 ; b2 }}} =3D=3D=3D=3D xpath_boolean =3D=3D=3D=3D Returns true if the XPath expression evaluates to true, or if a matching no= de is found. Match found: {{{ > SELECT xpath_boolean ('b', 'a/b') FROM src LIMIT 1 ; true }}} No match found: {{{ > SELECT xpath_boolean ('b', 'a/c') FROM src LIMIT 1 ; false }}} Match found: {{{ > SELECT xpath_boolean ('b', 'a/b =3D "b"') FROM src LIMIT 1 ; true }}} No match found: {{{ > SELECT xpath_boolean ('10', 'a/b < 10') FROM src LIMIT 1 ; false }}} =3D=3D=3D=3D xpath_short, xpath_int, xpath_long =3D=3D=3D=3D These functions return an integer numeric value, or the value zero if no ma= tch is found, or a match is found but the value is non-numeric. Mathematical operations are supported. In cases where the value overflows t= he return type, then the maximum value for the type is returned. No match: {{{ > SELECT xpath_int ('b', 'a =3D 10') FROM src LIMIT 1 ; 0 }}} Non-numeric match: {{{ > SELECT xpath_int ('this is not a number', 'a') FROM src LIMIT 1 ; 0 > SELECT xpath_int ('this 2 is not a number', 'a') FROM src LIMIT 1 ; 0 }}} Adding values: {{{ > SELECT xpath_int ('1248', 'sum(a/*)') FROM src LIMIT 1 ; 15 > SELECT xpath_int ('1248', 'sum(a/b)') FROM src LIMIT 1 ; 7 > SELECT xpath_int ('1248', 'sum(a/b[@class=3D"odd"])') FROM src LIMIT 1= ; 5 }}} Overflow: {{{ > SELECT xpath_int ('200000000040000000000', 'a/b * a/= c') FROM src LIMIT 1 ; 2147483647 }}} =3D=3D=3D=3D xpath_float, xpath_double, xpath_number =3D=3D=3D=3D Similar to xpath_short, xpath_int and xpath_long but with floating point se= mantics. Non-matches result in zero. However, non-numeric matches result in NaN. Note that {{{xpath_number()}}} is an ali= as for {{{xpath_double()}}}. No match: {{{ > SELECT xpath_double ('b', 'a =3D 10') FROM src LIMIT 1 ; 0.0 }}} Non-numeric match: {{{ > SELECT xpath_double ('this is not a number', 'a') FROM src LIMIT 1= ; NaN }}} A very large number: {{{ SELECT xpath_double ('200000000040000000000', 'a/b * a= /c') FROM src LIMIT 1 ; 8.0E19 }}} =3D UDAFs =3D =3D UDTFs =3D