hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Himanshu Vijay <himansh...@gmail.com>
Subject Which approach for handling heavily nested JSONs ?
Date Sat, 06 Apr 2013 00:52:14 GMT
Hi,

I have been dealing with some heavily nested and complex JSON data. It has
all sorts of combinations like:
Struct<....results: <array<struct<...scores: <array<int>>>>>,...>

I wanted to know which approach you find better: using the SerDe or using
the UDFs. In my opinion the two approaches can be compared in the following
way. Please correct me if your experience has been different in some regard:


 Approach 1: UDF based
 Approach 2: SerDe based  *Table Schema*

Very small.

Big.

Mirrors the JSON structure and directly proportional to how complex/heavily
nested JSON is.
  *Size/verbosity of Query*

More esp. if using 'lateral view json_tuple(..)'
 Less

 *Maintenance Effort: JSON structure changes*

Update Query (optional)*

Update Schema.

Update Query (optional)*
  *Processing heavily nested/complex JSON* May need to write couple of
custom UDFs but all in all possible. The SerDes available out there need
patching as they are not mature.**

   - *Only if the field that got added or whose position got changed is
   needed to be queried.
   - Do you know any JSON SerDes that are robust enough to process complex
   JSONs ?

 I think I find more examples of UDF based approach. If the SerDes are less
error prone then the queries will end up being really succinct.

Regards,
Himanshu

Mime
View raw message