hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Angeles (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1027) Create UDFs for XPath expression evaluation
Date Thu, 07 Jan 2010 22:18:54 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797811#action_12797811

Patrick Angeles commented on HIVE-1027:

>Thanks for the detailed explanations. It seems we are supporting XPath 1.0 here. When
you say "xpath() returns multiple nodes(list)", do you mean it returns a
> serialized XML string representing the list of nodes such as <a>a1</a><a>a2</a>
...? In this case, do you have a test case for composing xpath() functions. For
> example and subquery returns XML string from the result of xpath() and the outer query
takes that input to another xpath*() function?

No, xpath() always returns a hive array of strings. If the expression results in a non-text
value (e.g., another xml node) the function will return an empty array. So really, there's
only 2 uses for xpath(): to get a list of node text values or to get a list of attribute values.
For example:

> select xpath('<a><b>b1</b><b>b2</b></a>','a/*') from
src limit 1 ;
> select xpath('<a><b>b1</b><b>b2</b></a>','a/*/text()')
from src limit 1 ;   // note the text() at the end of the expression
> select xpath('<a><b id="foo">b1</b><b id="bar">b2</b></a>','//@id')
from src limit 1 ;  

This behavior can be changed, but I feel that going down the path of returning nested results
is suboptimal. I'm open to ideas, however.

> For (4) I'm sure whether we should interpret of empty list as empty string etc. We can
definitely define the mapping between the XML model to relation model this way,
> but it doesn't distinguish the case where the xpath_string() result is an empty list
or it is a single node but the value of the node is empty (e.g., <a/> vs. no <a>
> element).
Agreed. Unfortunately, the Java XPath API on which this is built on returns an empty string
on both cases. I can internally change it so it queries for a node instead of a string, then
extract the string from the node. I get the feeling that this is less performant but I have
no facts to back this up.

> Also all this information is better to be exposed to the wider community (not only developers)
as well. Can you also add all these to the Hive's wiki page?
Absolutely... I will update the Hive Wiki once this is committed.

> Create UDFs for XPath expression evaluation
> -------------------------------------------
>                 Key: HIVE-1027
>                 URL: https://issues.apache.org/jira/browse/HIVE-1027
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Patrick Angeles
>            Assignee: Patrick Angeles
>            Priority: Minor
>         Attachments: hive-1027.patch, udf_xpath.patch
> Create UDFs for evaluating XPath expressions against XML documents.
> Examples:
> > SELECT xpath_double ('<a><b class="odd">1</b><b class="even">2</b><b
class="odd">4</b><c>8</c></a>', 'sum(a/b[@class="odd"])') FROM
src LIMIT 1 ;
> 5.0
> > SELECT xpath_string ('<a><b>b1</b><b>b2</b></a>',
'a/b[2]') FROM src LIMIT 1 ;
> b2
> > SELECT xpath ('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>',
'a/c/text()') FROM src LIMIT 1 ;
> ["c1","c2"]
> Included functions are: xpath_short, xpath_int, xpath_long, xpath_float, xpath_double/xpath_number,
xpath_string, xpath

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message