hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sachin Goyal <>
Subject Re: how to load json with nested array into hive?
Date Mon, 23 Jun 2014 16:57:09 GMT

You can also use hive-json-schema to automate Hive schema generation from JSON:

From: Nitin Pawar <<>>
Reply-To: "<>" <<>>
Date: Monday, June 23, 2014 at 2:25 AM
To: "<>" <<>>
Subject: Re: how to load json with nested array into hive?

I think you can just take a look at jsonserde

It does take care of nested json documents. (though you will need to know entire json structure

Here is example of using it

On Mon, Jun 23, 2014 at 2:28 PM, Christian Link <<>>
Hi Jerome,

thanks...I've already found "Brickhouse" and the Hive UDFs, but it didn't help.

Today I'll try again to process the json file after going through all my mails...maybe I'll
find a solution.


On Fri, Jun 20, 2014 at 7:16 PM, Jerome Banks <<>>
   Sorry to spam this newsgroup, and this is not a commercial endorsement, but check out the
Hive UDFs in the Brickhouse project ( ) (

You can convert arbitrary complex Hive structures to an from json with it's to_json and from_json
UDF's. See the blog posting for an explanation.

-- jerome

On Fri, Jun 20, 2014 at 8:26 AM, Christian Link <<>>

I'm very, very new to Hadoop, Hive, etc. and I have to import data into hive tables.

Environment: Amazon EMR, S3, etc.

The input file is on S3 and I copied it into my HDFS.

1. flat table with one column and loaded data into it:

  CREATE TABLE mdmp_raw_data (json_record STRING);
  LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO TABLE `mdmp_raw_data`;
That worked, I can access some data, like this:

SELECT d.carrier, d.language,
  FROM mdmp_raw_data a LATERAL VIEW json_tuple(, 'requestTimestamp', 'context') b  
 AS requestTimestamp, context
  LATERAL VIEW json_tuple(b.context, 'locale') c AS locale
  LATERAL VIEW json_tuple(c.locale, 'carrier', 'language', 'country') d AS carrier, language,
  LIMIT 1;

Result: o2 - de Deutsch Deutschland

I can also select the array at once:

SELECT b.requestTimestamp, b.batch
  FROM mdmp_raw_data a
  LATERAL VIEW json_tuple(, 'requestTimestamp', 'batch') b AS requestTimestamp, batch
  LIMIT 1;
This will give me:

Doge Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much
joy."}}, ...]

This "batch" may contain n events will a structure like above.

I want to put all events in a table where each "element" will be stored in a unique column:
timestamp, requestId, sessionId, event, userId, action, context, properties

2. explode the "batch" I read a lot about SerDe, etc. - but I don't get it.

- I tried to create a table with an array and load the data into it - several errors
use explode in query but it doesn't accept "batch" as array
- integrated several SerDes but get things like "unknown function jspilt"
- I'm lost in too many documents, howtos, etc. and could need some advices...

Thank you in advance!

Best, Chris

Nitin Pawar

View raw message