hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Link <christian.l...@mdmp.com>
Subject How to load json data with nested arrays into hive?
Date Fri, 20 Jun 2014 15:19:51 GMT
Hi,


I'm very, very new to Hadoop, Hive, etc. and I have to import data into
hive tables.

Environment: Amazon EMR, S3, etc.

The input file is on S3 and I copied it into my HDFS: hadoop distcp
s3n://mdmp-api-dump/1403181319.json hdfs:///input-api

*Example:*

{"requestTimestamp":"2014-06-19T10:52:26+02:00","context":{"libraryVersion":"0.6.13","telephony":{"radio":"gsm","carrier":"o2
-
de"},"wifi":{"connected":true,"available":true},"location":{},"locale":{"carrier":"o2
-
de","language":"Deutsch","country":"Deutschland"},"library":"analytics-android","device":{"brand":"htc","model":"HTC
One
S","sdk":16,"release":"4.1.1","manufacturer":"HTC"},"display":{"density":1.5,"width":540,"height":960},"build":{"name":"1.0","code":1},"integrations":{"Tapstream":false,"Amplitude":false,"Localytics":false,"Flurry":false,"Countly":false,"Bugsnag":false,"Quantcast":false,"Crittercism":false,"Google
Analytics":false,"Mixpanel":false}},"batch":[{"timestamp":"2014-06-19T10:52:18+02:00","requestId":"43e9de90-152a-4649-8e1d-eb159489cc70","sessionId":"75cd18db8a364c2","event":"TEST
Doge App
Usage","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{}},{"action":"identify","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"timestamp":"2014-06-19T10:52:18+02:00","traits":{"email":"
doge@mdmp.com","name":"Doge"},"requestId":"20c56610-3135-4c07-9975-d930492f3ff1","sessionId":"75cd18db8a364c2","userId":"doge019mr8mf4ruffruff"},{"timestamp":"2014-06-19T10:52:18+02:00","requestId":"6b91890e-a67b-4f60-86cf-ec387588fe0a","sessionId":"75cd18db8a364c2","event":"TEST
Doge
Purchase","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,90"}},{"timestamp":"2014-06-19T10:52:18+02:00","requestId":"7f306aee-bd37-48f1-85cd-fd7dde412569","sessionId":"75cd18db8a364c2","event":"TEST
Doge
Comments","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Such
App!"}},{"timestamp":"2014-06-19T10:52:19+02:00","requestId":"3aa2b69f-2977-4002-a6ff-633a7728c856","sessionId":"75cd18db8a364c2","event":"TEST
Doge
Comments","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Such
App!"}},{"timestamp":"2014-06-19T10:52:21+02:00","requestId":"ca706c2d-8055-423a-ac81-c738b6bf4e52","sessionId":"75cd18db8a364c2","event":"TEST
Doge
Purchase","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"1,89"}},{"timestamp":"2014-06-19T10:52:21+02:00","requestId":"a1706a40-1581-4bf4-a1b4-f95314338991","sessionId":"75cd18db8a364c2","event":"TEST
Doge
Comments","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Such
App!"}},{"timestamp":"2014-06-19T10:52:22+02:00","requestId":"6f0828cd-7dd2-46b3-b64c-248f02da7966","sessionId":"75cd18db8a364c2","event":"TEST
Doge
Comments","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff
ruff!"}},{"timestamp":"2014-06-19T10:52:24+02:00","requestId":"04910cc0-42e3-4fcd-a4d2-3c10390e5a50","sessionId":"75cd18db8a364c2","event":"TEST
Doge
Purchase","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"1,22"}},{"timestamp":"2014-06-19T10:52:24+02:00","requestId":"b077dbf2-0e01-4b84-a3d3-855ab2cd45c8","sessionId":"75cd18db8a364c2","event":"TEST
Doge
Comments","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Such
App!"}},{"timestamp":"2014-06-19T10:52:25+02:00","requestId":"2871f36f-ca2c-4540-a1a7-074ec3156b59","sessionId":"75cd18db8a364c2","event":"TEST
Doge
Comments","userId":"doge019mr8mf4ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff
ruff!"}}],"writeKey":"a8RCFSAVjmT5qyxLKMzt12kcXWOIusvw","action":"import","received_at":"2014-06-19T08:52:29.189+00:00"}

I've created a database in hive.

I tried several things.

*1. flat table with one column and loaded data into it:*

  CREATE TABLE mdmp_raw_data (json_record STRING);
  LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO
TABLE `mdmp_raw_data`;

That worked, I can access some data, like this:

SELECT d.carrier, d.language, d.country
  FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data,
'requestTimestamp', 'context') b    AS requestTimestamp, context
  LATERAL VIEW json_tuple(b.context, 'locale') c AS locale
  LATERAL VIEW json_tuple(c.locale, 'carrier', 'language', 'country')
d AS carrier, language, country
  LIMIT 1;

Result: o2 - de Deutsch Deutschland

I can also *select* the array at once:

SELECT b.requestTimestamp, b.batch
  FROM mdmp_raw_data a
  LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'batch') b AS
requestTimestamp, batch
  LIMIT 1;

This will give me:

 [{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"2ca08247-5542-4cb4-be7e-4a8574fb77a8","sessionId":"f29ec175ca6b7d10","event":"TEST
Doge Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much
joy."}}, ...]

This batch may contain n events will a structure like above.

I want to put all events in a table where each "element" will be stored in
a unique column: timestamp, requestId, sessionId, event, userId, action,
context, properties

*2. explode the "batch"* I read a lot about SerDe, etc. - but I don't get
it.

   - I tried to create a table with an array and load the data into it -
   several errors
   - use explode in query but it doesn't accept "batch" as array
   - integrated several SerDes but get things like "unknown function jspilt"

I'm lost in too many documents, howtos, etc. and could need some advices...

Thank you in advance!

Best, Chris

Mime
View raw message