hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Swagatika Tripathy <swagatikat...@gmail.com>
Subject Re: how to load json with nested array into hive?
Date Mon, 23 Jun 2014 18:51:55 GMT
Hi,
Use 1.9.3 Jason serde with dependencies jar. Its the latest one I suppose.

Thanks
Swagatika
On Jun 23, 2014 11:57 PM, "Roberto Congiu" <roberto.congiu@openx.com> wrote:

> Hi,
> 1.1.4 is a oldish version of the JSON serde, have you tried with the most
> recent from the master branch ?
>
>
> On Mon, Jun 23, 2014 at 10:23 AM, Christian Link <christian.link@mdmp.com>
> wrote:
>
>> Hi,
>>
>> thanks...but I need to sort things out with ONE SerDe/strategy...
>> I've started with André's idea by using Roberto Congiu's SerDe and
>> André's template to create a table with the right schema and loading the
>> data aftrerwards.
>>
>> But it's not completely working...
>>
>> I did the following (sorry for spaming...):
>>
>> 1. create table and load data
>>
>> -- create database (if not exists)
>> CREATE DATABASE IF NOT EXISTS mdmp_api_dump;
>>
>> -- connect to database;
>> USE mdmp_api_dump;
>>
>> -- add SerDE for json processing
>> ADD JAR /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar;
>>
>> -- drop old raw data
>> DROP TABLE IF EXISTS mdmp_raw_data;
>>
>> -- create raw data table
>> CREATE TABLE mdmp_raw_data (
>>   action string,
>>   batch array<
>>           struct<
>>             timestamp:string,
>>             traits:map<string,string>,
>>             requestId:string,
>>             sessionId:string,
>>             event:string,
>>             userId:string,
>>             action:string,
>>             context:map<string,string>,
>>             properties:map<string,string>
>>
>>           >
>>         >,
>>   context struct<
>>             build:map<string,string>,
>>             device:struct<
>>                      brand:string,
>>                      manufacturer:string,
>>                      model:string,
>>                      release:string,
>>                      sdk:int
>>                    >,
>>             display:struct<
>>                       density:double,
>>                       height:int,
>>                       width:int
>>                     >,
>>             integrations:map<string,boolean>,
>>             library:string,
>>             libraryVersion:string,
>>             locale:map<string,string>,
>>             location:map<string,string>,
>>             telephony:map<string,string>,
>>             wifi:map<string,boolean>
>>           >,
>>   received_at string,
>>   requestTimestamp string,
>>   writeKey string
>> )
>> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
>> STORED AS TEXTFILE;
>>
>> -- load data
>> LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO TABLE
>> `mdmp_raw_data`;
>>
>> 2. run query against the "raw data" and create "formatted table":
>>
>> ADD JAR /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar;
>>
>> USE mdmp_api_dump;
>>
>> DROP TABLE IF EXISTS mdmp_api_data;
>>
>> CREATE TABLE mdmp_api_data AS
>> SELECT DISTINCT
>>   a.action,
>>   a.received_at,
>>   a.requestTimestamp,
>>   a.writeKey,
>>   a.context.device.brand as brand,
>>   a.context.device.manufacturer as manufacturer,
>>   a.context.device.model as model,
>>   a.context.device.release as release,
>>   a.context.device.sdk as sdk,
>> --  a.context.display.density as density,
>>   a.context.display.height as height,
>>   a.context.display.width as width,
>>   a.context.telephony['radio'] as tel_radio,
>>   a.context.telephony['carrier'] as tel_carrier,
>>   a.context.wifi['connected'] as wifi_connected,
>>   a.context.wifi['available'] as wifi_available,
>>    a.context.locale['carrier'] as loce_carrier,
>>   a.context.locale['language'] as loce_language,
>>   a.context.locale['country'] as loce_country,
>>   a.context.integrations['Tapstream'] as int_tapstream,
>>   a.context.integrations['Amplitude'] as int_amplitude,
>>   a.context.integrations['Localytics'] as int_localytics,
>>   a.context.integrations['Flurry'] as int_flurry,
>>   a.context.integrations['Countly'] as int_countly,
>>   a.context.integrations['Quantcast'] as int_quantcast,
>>   a.context.integrations['Crittercism'] as int_crittercism,
>>   a.context.integrations['Google Analytics'] as int_googleanalytics,
>>   a.context.integrations['Mixpanel'] as int_mixpanel,
>>   b.batch.action AS b_action,
>>   b.batch.context,
>>   b.batch.event,
>>   b.batch.properties,
>>   b.batch.requestId,
>>   b.batch.sessionId,
>>   b.batch.timestamp,
>>   b.batch.traits,
>>   b.batch.userId
>> FROM mdmp_raw_data a
>> LATERAL VIEW explode(a.batch) b AS batch;
>>
>> So far so good... (besides a silly double/int bug in the outdated SerDe)
>> I thought.
>>
>> But it turned out, that some fields are NULL - within all records.
>>
>> Affected fields are:
>>   b.batch.event,
>>   b.batch.requestId,
>>   b.batch.sessionId,
>>   b.batch.userId
>>
>> I can see values in the json file, but neither  in the "raw table" nor in
>> the final table...that's really strange.
>>
>> An example record:
>> {"requestTimestamp":"2014-06-19T14:25:26+02:00","context":{"libraryVersion":"0.6.13","telephony":{"radio":"gsm","carrier":"o2
>> -
>> de"},"wifi":{"connected":true,"available":true},"location":{},"locale":{"carrier":"o2
>> -
>> de","language":"Deutsch","country":"Deutschland"},"library":"analytics-android","device":{"brand":"htc","model":"HTC
>> One
>> S","sdk":16,"release":"4.1.1","manufacturer":"HTC"},"display":{"density":1.5,"width":540,"height":960},"build":{"name":"1.0","code":1},"integrations":{"Tapstream":false,"Amplitude":false,"Localytics":false,"Flurry":false,"Countly":false,"Bugsnag":false,"Quantcast":false,"Crittercism":false,"Google
>> Analytics":false,"Mixpanel":false}},"batch":[{"timestamp":"2014-06-19T14:25:17+02:00","requestId":"32377337-3f99-4ac5-bfc6-d3654584655b","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge
>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff
>> ruff!"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"fbfd45c9-cf0f-4cb3-955c-85c65220a5bd","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge
>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,08"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"3a643b12-64e5-4a7c-b44b-e3e09dbc5b66","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge
>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"action":"identify","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"timestamp":"2014-06-19T14:25:19+02:00","traits":{"email":"
>> doges@mdmp.com","name":"Carmelo
>> Doge"},"requestId":"ef2910f4-cd4f-4175-89d0-7d91b35c229f","sessionId":"75cd18db8a364c2","userId":"doge74167705ruffruff"},{"timestamp":"2014-06-19T14:25:19+02:00","requestId":"1676bb06-abee-4135-a206-d57c4a1bc24d","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge App
>> Usage","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{}},{"timestamp":"2014-06-19T14:25:20+02:00","requestId":"66532c8a-c5da-4852-b8b6-04df8f3052d5","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge
>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Many
>> data."}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"a1a79d8c-fe58-4567-8dec-a8d1d2ae2713","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge
>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,87"}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"259209ac-b135-4d5f-bdac-535eccc0400e","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge
>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"timestamp":"2014-06-19T14:25:23+02:00","requestId":"59b6d57c-c7a5-4b2a-af6d-fa10ae0de60c","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge
>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Such
>> App!"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"8b05226f-bdf5-4af8-bb91-84da1b874c6e","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge
>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,50"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"0f366675-5641-4238-b2a9-176735de6edd","sessionId":"75cd18db8a364c2","event":"TEST
>> Doge
>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff
>> ruff!"}},{"timestamp":"2014-06-19T14:25:26+02:00","requestId":"9e832534-5114-4ec1-bc20-1dcf1c354d0c","sessionId":"75cd18db8a364c2","event":"Session
>> end","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"start":"14:25:09","end":"14:25:26"}}],"writeKey":"a8RCFSAVjmT5qyxLKMzt12kcXWOIusvw","action":"import","received_at":"2014-06-19T12:25:29.790+00:00"}
>>
>>
>> Funny thing is, that I'm sure that I've seen these values earlier
>> today...I've reloaded the data/tables several times to see if this is still
>> working...well. :)
>>
>> I'm gonna stop for today...another try tomorrow.
>>
>> Thanks so far and many greetings from Berlin,
>> Chris
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jun 23, 2014 at 6:57 PM, Sachin Goyal <sgoyal@walmartlabs.com>
>> wrote:
>>
>>>
>>> You can also use hive-json-schema to automate Hive schema generation
>>> from JSON:
>>> https://github.com/quux00/hive-json-schema
>>>
>>>
>>> From: Nitin Pawar <nitinpawar432@gmail.com<mailto:
>>> nitinpawar432@gmail.com>>
>>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" <
>>> user@hive.apache.org<mailto:user@hive.apache.org>>
>>> Date: Monday, June 23, 2014 at 2:25 AM
>>> To: "user@hive.apache.org<mailto:user@hive.apache.org>" <
>>> user@hive.apache.org<mailto:user@hive.apache.org>>
>>> Subject: Re: how to load json with nested array into hive?
>>>
>>> I think you can just take a look at jsonserde
>>>
>>> It does take care of nested json documents. (though you will need to
>>> know entire json structure upfront)
>>>
>>> Here is example of using it
>>> http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
>>>
>>>
>>>
>>>
>>> On Mon, Jun 23, 2014 at 2:28 PM, Christian Link <christian.link@mdmp.com
>>> <mailto:christian.link@mdmp.com>> wrote:
>>> Hi Jerome,
>>>
>>> thanks...I've already found "Brickhouse" and the Hive UDFs, but it
>>> didn't help.
>>>
>>> Today I'll try again to process the json file after going through all my
>>> mails...maybe I'll find a solution.
>>>
>>> Best,
>>> Chris
>>>
>>>
>>> On Fri, Jun 20, 2014 at 7:16 PM, Jerome Banks <jbanks@tagged.com<mailto:
>>> jbanks@tagged.com>> wrote:
>>> Christian,
>>>    Sorry to spam this newsgroup, and this is not a commercial
>>> endorsement, but check out the Hive UDFs in the Brickhouse project (
>>> http://github.com/klout/brickhouse ) (
>>> http://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/
>>> )
>>>
>>> You can convert arbitrary complex Hive structures to an from json with
>>> it's to_json and from_json UDF's. See the blog posting for an explanation.
>>>
>>> -- jerome
>>>
>>>
>>> On Fri, Jun 20, 2014 at 8:26 AM, Christian Link <christian.link@mdmp.com
>>> <mailto:christian.link@mdmp.com>> wrote:
>>> hi,
>>>
>>> I'm very, very new to Hadoop, Hive, etc. and I have to import data into
>>> hive tables.
>>>
>>> Environment: Amazon EMR, S3, etc.
>>>
>>> The input file is on S3 and I copied it into my HDFS.
>>>
>>> 1. flat table with one column and loaded data into it:
>>>
>>>   CREATE TABLE mdmp_raw_data (json_record STRING);
>>>   LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO
>>> TABLE `mdmp_raw_data`;
>>> That worked, I can access some data, like this:
>>>
>>> SELECT d.carrier, d.language, d.country
>>>   FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data,
>>> 'requestTimestamp', 'context') b    AS requestTimestamp, context
>>>   LATERAL VIEW json_tuple(b.context, 'locale') c AS locale
>>>   LATERAL VIEW json_tuple(c.locale, 'carrier', 'language', 'country') d
>>> AS carrier, language, country
>>>   LIMIT 1;
>>>
>>> Result: o2 - de Deutsch Deutschland
>>>
>>> I can also select the array at once:
>>>
>>> SELECT b.requestTimestamp, b.batch
>>>   FROM mdmp_raw_data a
>>>   LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'batch') b AS
>>> requestTimestamp, batch
>>>   LIMIT 1;
>>> This will give me:
>>>
>>>  [{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"2ca08247-5542-4cb4-be7e-4a8574fb77a8","sessionId":"f29ec175ca6b7d10","event":"TEST
>>> Doge
>>> Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much
>>> joy."}}, ...]
>>>
>>> This "batch" may contain n events will a structure like above.
>>>
>>> I want to put all events in a table where each "element" will be stored
>>> in a unique column: timestamp, requestId, sessionId, event, userId, action,
>>> context, properties
>>>
>>> 2. explode the "batch" I read a lot about SerDe, etc. - but I don't get
>>> it.
>>>
>>> - I tried to create a table with an array and load the data into it -
>>> several errors
>>> use explode in query but it doesn't accept "batch" as array
>>> - integrated several SerDes but get things like "unknown function jspilt"
>>> - I'm lost in too many documents, howtos, etc. and could need some
>>> advices...
>>>
>>> Thank you in advance!
>>>
>>> Best, Chris
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> ----------------------------------------------------------
> Good judgement comes with experience.
> Experience comes with bad judgement.
> ----------------------------------------------------------
> Roberto Congiu - Data Engineer - OpenX
> tel: +1 626 466 1141
>

Mime
View raw message