hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bobby Rullo <bo...@metaweb.com>
Subject Re: JSON Column Type
Date Sun, 04 Oct 2009 01:44:28 GMT

On Oct 3, 2009, at 1:28 AM, Zheng Shao wrote:

> I got it. You mean TimeSpentQuerying, PageType, TotalRevenue and  
> UserAgent are all UDFs that takes a JSON object and outputs a STRING.
>
Exactly!

> Allowing such a new object type just for UDFs are simpler than  
> supporting a new type in all parts of the system, because we don't  
> need to care about serialization/deserialization.
>

Interesting. So you're saying it's ok to return whatever type from  
UDF.evaluate(), as long as it doesn't get serialized (i.e. as long as  
it's not the final output)?

>
> If we go that route, the final query will look like this. Note that  
> StringToJson takes a STRING and returns a JSON, and  
> TimeSpentQuerying (etc) takes a JSON and returns either STRING or  
> other types that are supported by Hive.
>
> SELECT TimeSpentQuerying(txnjson), PageType(txnjson),  
> TotalRevenue(txnjson) from (SELECT StringToJson(txn) as txnjson FROM  
> txn_logs) tnxjson_logs
> where UserAgent(txnjson) like '%Safari%';
>
> Is this good enough?
>

That's pretty damn good. I like that all I have to do is create one  
more UDF which does the JSON parsing (and maybe JsonToString as well.)

The only thing that I'm concerned about is whether that sub-query is  
going to create an extra map job. Is the planner smart enough to stick  
stick all that work in one map?


Thanks!

bobby

>
> Zheng
>
> On Fri, Oct 2, 2009 at 11:50 PM, Bobby Rullo <bobby@metaweb.com>  
> wrote:
> Zheng,
>
> Thanks for your reply.
>
> I'm not sure what this "get_json_objects" function would do...do you  
> mean it would return a canned set of fields? That's not what I'm  
> interested in. The UDFs that I have in mind dig deep into the JSON  
> and fetch/filter/transform data. I don't want a few canned fields to  
> play with, I want our users to be able to create AdHoc queries on  
> the JSON using any and all of the data inside.
>
> Here's a more realistic example. 'Txn' is a JSON field:
>
> select TimeSpentQuerying(txn), PageType(txn), TotalRevenue(txn) from  
> txn_logs where UserAgent(txn) like '%Safari%'
>
> I have no problem writing the UDFs, I'm just trying to avoid parsing  
> the JSON in each one of them - in other words, I want the input to  
> the UDF's to not be a String (or Text) which I need to parse into  
> JSON, but rather a JSON object.
>
> Is that possible? Can I create new field types? And if so, where do  
> I register them?
>
> My fantasy is to be able to do this:
>
> CREATE TABLE txn_logs (tid String, txn JSON);
>
> Thanks,
>
> Bobby
>
>
>
> On Oct 2, 2009, at 10:54 PM, Zheng Shao wrote:
>
>> We have 2 example serdes, one for text data (regexserde), one for
>> binary data (thriftserde).
>>
>> But the simplest solution for this is to add a udf get_json_objects
>> that returns a bunch of fields in an array. Then we can use sub query
>> to extract the array elements as individual elements, until we have
>> support for common expression extraction.
>>
>>
>> There are a few example udf in the ctrib package illustrating the way
>> to accept variable number of arguments.
>>
>> Let us know how it goes.
>>
>> Zheng
>>
>>
>>
>> On 10/2/09, Bobby Rullo <bobby@metaweb.com> wrote:
>>> Hi there,
>>>
>>> I want to create a new JSON Field/Column type. I know there exists
>>> get_json_object(), but the things is I want to multiple JSON
>>> operations in a single select statement and don't want to incur the
>>> cost of parsing the JSON over and over again, because our json
>>> structures are quite large.
>>>
>>> Our schema is two fields - the first is a string of some sort used  
>>> as
>>> an id, and the second is a huge JSON structure. I want to be able
>>> write sql like the following:
>>>
>>> select Foo(json), Bar(json), Baz(json) where Quux(json) like  
>>> '%hello%';
>>>
>>> Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.
>>> Right now what I'm doing is storing the JSON as a string and in each
>>> of those UDF's I am de-serializing the json, and returning another
>>> string.
>>>
>>> Ideally I'd want to be able to return JSON objects from these  
>>> UDF's so
>>> I could compose my functions like "Foo(Bar(Baz(json))"
>>>
>>> I'm thinking I need to create a SerDe but SerDe's seem to work at  
>>> the
>>> table level, not at the field level (in other words, I just need a
>>> custom field, not a entire custom table type...or do I!?)
>>>
>>> Also, if there's any docs on writing custom SerDe's please let me
>>> know...I haven't been able to find much.
>>>
>>> Thanks!
>>>
>>> Bobby
>>>
>>
>> -- 
>> Sent from Gmail for mobile | mobile.google.com
>>
>> Yours,
>> Zheng
>
>
>
>
> -- 
> Yours,
> Zheng


Mime
View raw message