hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Praveen Kumar <pk1u...@gmail.com>
Subject Hive JSON transformation problem
Date Fri, 03 Jun 2011 04:21:32 GMT
I have a log table with a single column, where each row contains JSON
string in the following format; here are two log entries:

{
   'foo0': {   'bar0': 'A',
               'bar1': 'B'}
   'foo1': [   {   'params': {   'key0': 'valX', 'key1' : 'val1'},
                   'time': 'time0'},
               {   'params': {   'key1': 'val1', 'key2' : 'val2'},
                   'time': 'time1'},
               {   'params': {   'key3': 'val3', 'key0' : 'valY'},
                   'time': 'time2'},
               {   'params': {   'key4': 'val4', 'key2' : 'val2'},
                   'time': 'time3'},
               {   'params': {   'key0': 'valZ', 'key6' : 'val6'},
                   'time': 'time4'} ]
}

{
   'foo0': {   'bar0': 'C',
               'bar1': 'D'}
   'foo1': [   {   'params': {   'key0': 'valM', 'key1' : 'val1'},
                   'time': 'time5'},
               {   'params': {   'key1': 'val1', 'key2' : 'val2'},
                   'time': 'time6'},
               {   'params': {   'key3': 'val3', 'key0' : 'valN'},
                   'time': 'time7'},
               {   'params': {   'key4': 'val4', 'key2' : 'val2'},
                   'time': 'time8'},
               {   'params': {   'key0': 'valO', 'key6' : 'val6'},
                   'time': 'time9'} ]
}


I have a problem where I need to transform this and insert it into a
table with the following format:

hive> desc log2;
OK
bar0 string	
bar1 string
key0Val string
time

The results are related specifically to 'key0'. Eg, if the log table
contains the above JSON string in two rows, then after completing the
transformation, the results of querying the resulting table ( log2,
which has 4 columns ) needs to be:

select * from log2;
OK
A    B    valX    time0
A    B    valY    time2
A    B    valZ    time4
C    D    valM    time5
C    D    valN    time7
C    D    valO    time9


If I didn't need the 'time' column, I'm aware that I could get this
transformation result:

select * from log3;
OK
A    B    valX
A    B    valY
A    B    valZ
C    D    valM
C    D    valN
C    D    valO

via this query:

select get_json_object( log.entry, '$.foo0.bar0' ),
       get_json_object( log.entry, '$.foo0.bar1' ),
       t.key0Val,
from log
lateral view explode(split(regexp_replace(
    get_json_object( log.entry,
        '$.foo1.params.key0' ), '[\\[\\]"]', ''), ',')) t as key0Val;


But I need the 'log2' format with the 'time' column.

Note that I don't want to use row_sequence() to give the log entries
unique keys ( having one reducer task would be unacceptable ).

Thanks in advance,

pk

Mime
View raw message