hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Link <christian.l...@mdmp.com>
Subject Re: how to load json with nested array into hive?
Date Tue, 24 Jun 2014 11:54:34 GMT
ok - I'm sorting it out...I copied the "raw data" in a more formatted table
and can work with the data...

I'm able to explode "integrations" and all the other maps so that I can
handle the data.
Is there some kind of PIVOT to turn 10 rows / 2 columns into 2 rows / 10
columns (header = key)...

key_a val_a
key_b val_b

into

key_a key_b
val_a val_b

I'm looking this up later...just an idea. ;)

Thanks so far,
Chris


On Tue, Jun 24, 2014 at 12:44 PM, Christian Link <christian.link@mdmp.com>
wrote:

> ok - same results...so at least this seems to be stable ;)
>
> I try to find a clever way to get the values of "integrations".
>
>
> On Tue, Jun 24, 2014 at 12:37 PM, Christian Link <christian.link@mdmp.com>
> wrote:
>
>> Nope!
>> Still not working...new error
>>
>> Caused by: java.lang.ClassCastException:
>> org.apache.hadoop.io.DoubleWritable cannot be cast to
>> org.apache.hadoop.hive.serde2.io.DoubleWritable
>>
>> Caused by query:
>> SELECT DISTINCT
>>   a.context.device.brand as brand,
>>   a.context.device.manufacturer as manufacturer,
>>   a.context.device.model as model,
>>   a.context.device.release as release,
>>   a.context.device.sdk as sdk,
>>   *a.context.display.density as density*,
>>   a.context.display.height as height,
>>   a.context.display.width as width
>> FROM mdmp_raw_data a
>> LATERAL VIEW explode(a.batch) b AS batch
>> WHERE a.requestTimestamp = '2014-06-19T14:08:15+02:00'
>>   AND a.received_at = '2014-06-19T12:08:18.734+00:00';
>>
>> Table structure
>>
>> context
>>  struct
>> <
>> build:map<string,string>,
>>
>> device:struct<brand:string,manufacturer:string,model:string,release:string,sdk:int>,
>> display:struct<*density:double*,height:int,width:int>,
>>  integrations:map<string,boolean>,
>> library:string,libraryversion:string,
>>  locale:map<string,string>,
>> location:map<string,string>,
>>  telephony:map<string,string>,
>> wifi:map<string,boolean>
>>  >   from deserializer
>>
>> It's "density" which is causing the problem.
>> When I leave out density of the query, it's running fine.
>>
>>
>> Also, the values in "intergrations" are NULL...which had been NOT NULL,
>> yesterday...strange.
>>
>> table mdmp_raw_data (data imported using 1.1.9.2 Serde):
>>
>> SELECT DISTINCT
>>    a.context.integrations
>>  FROM mdmp_raw_data a
>> LATERAL VIEW explode(a.batch) b AS batch
>> WHERE a.requestTimestamp = '2014-06-19T14:08:15+02:00'
>>   AND a.received_at = '2014-06-19T12:08:18.734+00:00';
>>
>>
>> {"tapstream":false,"amplitude":false,"flurry":false,"quantcast":false,"countly":false,"mixpanel":false,"crittercism":false,"google
>> analytics":false,"bugsnag":false,"localytics":false}
>>
>> json file:
>> "integrations":{"Tapstream":false,"Amplitude":false,"Localytics":false,"Flurry":false,"Countly":false,"Bugsnag":false,"Quantcast":false,"Crittercism":false,"Google
>> Analytics":false,"Mixpanel":false}}
>>
>> So they keys had been changed to LOWER CASE...hum, not sure if this makes
>> sense...and WHY it happend.
>>
>> Again...I will import the data again and I will run the same
>> queries...let's see what will change next.
>>
>> THANKS a lot for all the support.
>>
>> Best,
>> Chris
>>
>>
>>
>>
>> On Tue, Jun 24, 2014 at 12:11 PM, Christian Link <christian.link@mdmp.com
>> > wrote:
>>
>>> I've got 1.1.9.2 (thanks to Roberto) and the data is looking better...
>>> I'll test the "doubel/int" thingie, now.
>>>
>>> Best,
>>> Chris
>>>
>>>
>>> On Mon, Jun 23, 2014 at 8:51 PM, Swagatika Tripathy <
>>> swagatikat856@gmail.com> wrote:
>>>
>>>> Hi,
>>>> Use 1.9.3 Jason serde with dependencies jar. Its the latest one I
>>>> suppose.
>>>>
>>>> Thanks
>>>> Swagatika
>>>> On Jun 23, 2014 11:57 PM, "Roberto Congiu" <roberto.congiu@openx.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> 1.1.4 is a oldish version of the JSON serde, have you tried with the
>>>>> most recent from the master branch ?
>>>>>
>>>>>
>>>>> On Mon, Jun 23, 2014 at 10:23 AM, Christian Link <
>>>>> christian.link@mdmp.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> thanks...but I need to sort things out with ONE SerDe/strategy...
>>>>>> I've started with André's idea by using Roberto Congiu's SerDe and
>>>>>> André's template to create a table with the right schema and loading
the
>>>>>> data aftrerwards.
>>>>>>
>>>>>> But it's not completely working...
>>>>>>
>>>>>> I did the following (sorry for spaming...):
>>>>>>
>>>>>> 1. create table and load data
>>>>>>
>>>>>> -- create database (if not exists)
>>>>>> CREATE DATABASE IF NOT EXISTS mdmp_api_dump;
>>>>>>
>>>>>> -- connect to database;
>>>>>> USE mdmp_api_dump;
>>>>>>
>>>>>> -- add SerDE for json processing
>>>>>> ADD JAR
>>>>>> /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar;
>>>>>>
>>>>>> -- drop old raw data
>>>>>> DROP TABLE IF EXISTS mdmp_raw_data;
>>>>>>
>>>>>> -- create raw data table
>>>>>> CREATE TABLE mdmp_raw_data (
>>>>>>   action string,
>>>>>>   batch array<
>>>>>>           struct<
>>>>>>             timestamp:string,
>>>>>>             traits:map<string,string>,
>>>>>>             requestId:string,
>>>>>>             sessionId:string,
>>>>>>             event:string,
>>>>>>             userId:string,
>>>>>>             action:string,
>>>>>>             context:map<string,string>,
>>>>>>             properties:map<string,string>
>>>>>>
>>>>>>           >
>>>>>>         >,
>>>>>>   context struct<
>>>>>>             build:map<string,string>,
>>>>>>             device:struct<
>>>>>>                      brand:string,
>>>>>>                      manufacturer:string,
>>>>>>                      model:string,
>>>>>>                      release:string,
>>>>>>                      sdk:int
>>>>>>                    >,
>>>>>>             display:struct<
>>>>>>                       density:double,
>>>>>>                       height:int,
>>>>>>                       width:int
>>>>>>                     >,
>>>>>>             integrations:map<string,boolean>,
>>>>>>             library:string,
>>>>>>             libraryVersion:string,
>>>>>>             locale:map<string,string>,
>>>>>>             location:map<string,string>,
>>>>>>             telephony:map<string,string>,
>>>>>>             wifi:map<string,boolean>
>>>>>>           >,
>>>>>>   received_at string,
>>>>>>   requestTimestamp string,
>>>>>>   writeKey string
>>>>>> )
>>>>>> ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
>>>>>> STORED AS TEXTFILE;
>>>>>>
>>>>>> -- load data
>>>>>> LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE INTO
>>>>>> TABLE `mdmp_raw_data`;
>>>>>>
>>>>>> 2. run query against the "raw data" and create "formatted table":
>>>>>>
>>>>>> ADD JAR
>>>>>> /home/hadoop/lib/hive/json-serde-1.1.4-jar-with-dependencies.jar;
>>>>>>
>>>>>> USE mdmp_api_dump;
>>>>>>
>>>>>> DROP TABLE IF EXISTS mdmp_api_data;
>>>>>>
>>>>>> CREATE TABLE mdmp_api_data AS
>>>>>> SELECT DISTINCT
>>>>>>   a.action,
>>>>>>   a.received_at,
>>>>>>   a.requestTimestamp,
>>>>>>   a.writeKey,
>>>>>>   a.context.device.brand as brand,
>>>>>>   a.context.device.manufacturer as manufacturer,
>>>>>>   a.context.device.model as model,
>>>>>>   a.context.device.release as release,
>>>>>>   a.context.device.sdk as sdk,
>>>>>> --  a.context.display.density as density,
>>>>>>   a.context.display.height as height,
>>>>>>   a.context.display.width as width,
>>>>>>   a.context.telephony['radio'] as tel_radio,
>>>>>>   a.context.telephony['carrier'] as tel_carrier,
>>>>>>   a.context.wifi['connected'] as wifi_connected,
>>>>>>   a.context.wifi['available'] as wifi_available,
>>>>>>    a.context.locale['carrier'] as loce_carrier,
>>>>>>   a.context.locale['language'] as loce_language,
>>>>>>   a.context.locale['country'] as loce_country,
>>>>>>   a.context.integrations['Tapstream'] as int_tapstream,
>>>>>>   a.context.integrations['Amplitude'] as int_amplitude,
>>>>>>   a.context.integrations['Localytics'] as int_localytics,
>>>>>>   a.context.integrations['Flurry'] as int_flurry,
>>>>>>   a.context.integrations['Countly'] as int_countly,
>>>>>>   a.context.integrations['Quantcast'] as int_quantcast,
>>>>>>   a.context.integrations['Crittercism'] as int_crittercism,
>>>>>>   a.context.integrations['Google Analytics'] as int_googleanalytics,
>>>>>>   a.context.integrations['Mixpanel'] as int_mixpanel,
>>>>>>   b.batch.action AS b_action,
>>>>>>   b.batch.context,
>>>>>>   b.batch.event,
>>>>>>   b.batch.properties,
>>>>>>   b.batch.requestId,
>>>>>>   b.batch.sessionId,
>>>>>>   b.batch.timestamp,
>>>>>>   b.batch.traits,
>>>>>>   b.batch.userId
>>>>>> FROM mdmp_raw_data a
>>>>>> LATERAL VIEW explode(a.batch) b AS batch;
>>>>>>
>>>>>> So far so good... (besides a silly double/int bug in the outdated
>>>>>> SerDe)
>>>>>> I thought.
>>>>>>
>>>>>> But it turned out, that some fields are NULL - within all records.
>>>>>>
>>>>>> Affected fields are:
>>>>>>   b.batch.event,
>>>>>>   b.batch.requestId,
>>>>>>   b.batch.sessionId,
>>>>>>   b.batch.userId
>>>>>>
>>>>>> I can see values in the json file, but neither  in the "raw table"
>>>>>> nor in the final table...that's really strange.
>>>>>>
>>>>>> An example record:
>>>>>> {"requestTimestamp":"2014-06-19T14:25:26+02:00","context":{"libraryVersion":"0.6.13","telephony":{"radio":"gsm","carrier":"o2
>>>>>> -
>>>>>> de"},"wifi":{"connected":true,"available":true},"location":{},"locale":{"carrier":"o2
>>>>>> -
>>>>>> de","language":"Deutsch","country":"Deutschland"},"library":"analytics-android","device":{"brand":"htc","model":"HTC
>>>>>> One
>>>>>> S","sdk":16,"release":"4.1.1","manufacturer":"HTC"},"display":{"density":1.5,"width":540,"height":960},"build":{"name":"1.0","code":1},"integrations":{"Tapstream":false,"Amplitude":false,"Localytics":false,"Flurry":false,"Countly":false,"Bugsnag":false,"Quantcast":false,"Crittercism":false,"Google
>>>>>> Analytics":false,"Mixpanel":false}},"batch":[{"timestamp":"2014-06-19T14:25:17+02:00","requestId":"32377337-3f99-4ac5-bfc6-d3654584655b","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge
>>>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff
>>>>>> ruff!"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"fbfd45c9-cf0f-4cb3-955c-85c65220a5bd","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge
>>>>>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,08"}},{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"3a643b12-64e5-4a7c-b44b-e3e09dbc5b66","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge
>>>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"action":"identify","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"timestamp":"2014-06-19T14:25:19+02:00","traits":{"email":"
>>>>>> doges@mdmp.com","name":"Carmelo
>>>>>> Doge"},"requestId":"ef2910f4-cd4f-4175-89d0-7d91b35c229f","sessionId":"75cd18db8a364c2","userId":"doge74167705ruffruff"},{"timestamp":"2014-06-19T14:25:19+02:00","requestId":"1676bb06-abee-4135-a206-d57c4a1bc24d","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge App
>>>>>> Usage","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{}},{"timestamp":"2014-06-19T14:25:20+02:00","requestId":"66532c8a-c5da-4852-b8b6-04df8f3052d5","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge
>>>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Many
>>>>>> data."}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"a1a79d8c-fe58-4567-8dec-a8d1d2ae2713","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge
>>>>>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,87"}},{"timestamp":"2014-06-19T14:25:21+02:00","requestId":"259209ac-b135-4d5f-bdac-535eccc0400e","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge
>>>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Wow..."}},{"timestamp":"2014-06-19T14:25:23+02:00","requestId":"59b6d57c-c7a5-4b2a-af6d-fa10ae0de60c","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge
>>>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Such
>>>>>> App!"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"8b05226f-bdf5-4af8-bb91-84da1b874c6e","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge
>>>>>> Purchase","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"revenue":"0,50"}},{"timestamp":"2014-06-19T14:25:24+02:00","requestId":"0f366675-5641-4238-b2a9-176735de6edd","sessionId":"75cd18db8a364c2","event":"TEST
>>>>>> Doge
>>>>>> Comments","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Ruff
>>>>>> ruff!"}},{"timestamp":"2014-06-19T14:25:26+02:00","requestId":"9e832534-5114-4ec1-bc20-1dcf1c354d0c","sessionId":"75cd18db8a364c2","event":"Session
>>>>>> end","userId":"doge74167705ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"start":"14:25:09","end":"14:25:26"}}],"writeKey":"a8RCFSAVjmT5qyxLKMzt12kcXWOIusvw","action":"import","received_at":"2014-06-19T12:25:29.790+00:00"}
>>>>>>
>>>>>>
>>>>>> Funny thing is, that I'm sure that I've seen these values earlier
>>>>>> today...I've reloaded the data/tables several times to see if this
is still
>>>>>> working...well. :)
>>>>>>
>>>>>> I'm gonna stop for today...another try tomorrow.
>>>>>>
>>>>>> Thanks so far and many greetings from Berlin,
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 23, 2014 at 6:57 PM, Sachin Goyal <sgoyal@walmartlabs.com
>>>>>> > wrote:
>>>>>>
>>>>>>>
>>>>>>> You can also use hive-json-schema to automate Hive schema generation
>>>>>>> from JSON:
>>>>>>> https://github.com/quux00/hive-json-schema
>>>>>>>
>>>>>>>
>>>>>>> From: Nitin Pawar <nitinpawar432@gmail.com<mailto:
>>>>>>> nitinpawar432@gmail.com>>
>>>>>>> Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>"
<
>>>>>>> user@hive.apache.org<mailto:user@hive.apache.org>>
>>>>>>> Date: Monday, June 23, 2014 at 2:25 AM
>>>>>>> To: "user@hive.apache.org<mailto:user@hive.apache.org>"
<
>>>>>>> user@hive.apache.org<mailto:user@hive.apache.org>>
>>>>>>> Subject: Re: how to load json with nested array into hive?
>>>>>>>
>>>>>>> I think you can just take a look at jsonserde
>>>>>>>
>>>>>>> It does take care of nested json documents. (though you will
need to
>>>>>>> know entire json structure upfront)
>>>>>>>
>>>>>>> Here is example of using it
>>>>>>> http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 23, 2014 at 2:28 PM, Christian Link <
>>>>>>> christian.link@mdmp.com<mailto:christian.link@mdmp.com>>
wrote:
>>>>>>> Hi Jerome,
>>>>>>>
>>>>>>> thanks...I've already found "Brickhouse" and the Hive UDFs, but
it
>>>>>>> didn't help.
>>>>>>>
>>>>>>> Today I'll try again to process the json file after going through
>>>>>>> all my mails...maybe I'll find a solution.
>>>>>>>
>>>>>>> Best,
>>>>>>> Chris
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 20, 2014 at 7:16 PM, Jerome Banks <jbanks@tagged.com
>>>>>>> <mailto:jbanks@tagged.com>> wrote:
>>>>>>> Christian,
>>>>>>>    Sorry to spam this newsgroup, and this is not a commercial
>>>>>>> endorsement, but check out the Hive UDFs in the Brickhouse project
(
>>>>>>> http://github.com/klout/brickhouse ) (
>>>>>>> http://brickhouseconfessions.wordpress.com/2014/02/07/hive-and-json-made-simple/
>>>>>>> )
>>>>>>>
>>>>>>> You can convert arbitrary complex Hive structures to an from
json
>>>>>>> with it's to_json and from_json UDF's. See the blog posting for
an
>>>>>>> explanation.
>>>>>>>
>>>>>>> -- jerome
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 20, 2014 at 8:26 AM, Christian Link <
>>>>>>> christian.link@mdmp.com<mailto:christian.link@mdmp.com>>
wrote:
>>>>>>> hi,
>>>>>>>
>>>>>>> I'm very, very new to Hadoop, Hive, etc. and I have to import
data
>>>>>>> into hive tables.
>>>>>>>
>>>>>>> Environment: Amazon EMR, S3, etc.
>>>>>>>
>>>>>>> The input file is on S3 and I copied it into my HDFS.
>>>>>>>
>>>>>>> 1. flat table with one column and loaded data into it:
>>>>>>>
>>>>>>>   CREATE TABLE mdmp_raw_data (json_record STRING);
>>>>>>>   LOAD DATA INPATH 'hdfs:///input-api/1403181319.json' OVERWRITE
>>>>>>> INTO TABLE `mdmp_raw_data`;
>>>>>>> That worked, I can access some data, like this:
>>>>>>>
>>>>>>> SELECT d.carrier, d.language, d.country
>>>>>>>   FROM mdmp_raw_data a LATERAL VIEW json_tuple(a.data,
>>>>>>> 'requestTimestamp', 'context') b    AS requestTimestamp, context
>>>>>>>   LATERAL VIEW json_tuple(b.context, 'locale') c AS locale
>>>>>>>   LATERAL VIEW json_tuple(c.locale, 'carrier', 'language',
>>>>>>> 'country') d AS carrier, language, country
>>>>>>>   LIMIT 1;
>>>>>>>
>>>>>>> Result: o2 - de Deutsch Deutschland
>>>>>>>
>>>>>>> I can also select the array at once:
>>>>>>>
>>>>>>> SELECT b.requestTimestamp, b.batch
>>>>>>>   FROM mdmp_raw_data a
>>>>>>>   LATERAL VIEW json_tuple(a.data, 'requestTimestamp', 'batch')
b AS
>>>>>>> requestTimestamp, batch
>>>>>>>   LIMIT 1;
>>>>>>> This will give me:
>>>>>>>
>>>>>>>  [{"timestamp":"2014-06-19T14:25:18+02:00","requestId":"2ca08247-5542-4cb4-be7e-4a8574fb77a8","sessionId":"f29ec175ca6b7d10","event":"TEST
>>>>>>> Doge
>>>>>>> Comments","userId":"doge96514016ruffruff","action":"track","context":{"library":"analytics-android","libraryVersion":"0.6.13"},"properties":{"comment":"Much
>>>>>>> joy."}}, ...]
>>>>>>>
>>>>>>> This "batch" may contain n events will a structure like above.
>>>>>>>
>>>>>>> I want to put all events in a table where each "element" will
be
>>>>>>> stored in a unique column: timestamp, requestId, sessionId, event,
userId,
>>>>>>> action, context, properties
>>>>>>>
>>>>>>> 2. explode the "batch" I read a lot about SerDe, etc. - but I
don't
>>>>>>> get it.
>>>>>>>
>>>>>>> - I tried to create a table with an array and load the data into
it
>>>>>>> - several errors
>>>>>>> use explode in query but it doesn't accept "batch" as array
>>>>>>> - integrated several SerDes but get things like "unknown function
>>>>>>> jspilt"
>>>>>>> - I'm lost in too many documents, howtos, etc. and could need
some
>>>>>>> advices...
>>>>>>>
>>>>>>> Thank you in advance!
>>>>>>>
>>>>>>> Best, Chris
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ----------------------------------------------------------
>>>>> Good judgement comes with experience.
>>>>> Experience comes with bad judgement.
>>>>> ----------------------------------------------------------
>>>>> Roberto Congiu - Data Engineer - OpenX
>>>>> tel: +1 626 466 1141
>>>>>
>>>>
>>>
>>
>

Mime
View raw message