hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Corona <>
Subject RE: Querying JSON/Thrift data?
Date Sat, 07 Mar 2009 15:52:39 GMT
Thanks for the reply. Would it be possible to add the tfiletransport -> sequencefile process
to the hive code base? If so, what type of timeframe would be associated with that (i.e, is
there alot of red tape to go through at facebook?)

I think that CSV maps and lists can be a possible short term solution. Can they support adding
new keys to the map? Also, what is the behavior when a key doesn't exist in a particular map
for a record? Null? Or does Hive throw an error?

I saw the JSON function but I think that the delimited maps/lists is a better solution because
we don't need nested maps/lists.

Thanks again!

Steve Corona

From: Joydeep Sen Sarma []
Sent: Saturday, March 07, 2009 1:43 AM
Subject: RE: Querying JSON/Thrift data?

Yes - it makes complete sense. This is what we do here for some data sets.

Unfortunately the open source code base does not have the loaders we run to convert thrift
records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option
is that we add this to Hive code base (should be straightforward).

Hive does supports maps and lists encoded in delimited text files (please take a look at the
DDL syntax). If that's good enough for you - that may be a better option. However - this support
does not support any more nesting (structs/lists/maps inside lists/maps).

The third option is to provide a JSON Serde. We would like to do this - but haven't yet. There
is a JSON function available in Hive that can take a json encoded column and evaluate expressions
over it. using this may be another short term workaround.

-----Original Message-----
From: Stephen Corona []
Sent: Friday, March 06, 2009 8:46 PM
Subject: RE: Querying JSON/Thrift data?

The input format can be whatever it needs to be to get it loaded into Hive.

I've been googling around all night and havn't really found what I am looking for. Basically,
I want to transfer some data from my web servers to hive  in a format that's a little more
verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am
planning on sending this serialized json or thrift data through scribe and loading it into
Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized
thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully
this makes sense...


From: Joydeep Sen Sarma []
Sent: Friday, March 06, 2009 11:24 PM
Subject: RE: Querying JSON/Thrift data?

can you describe a bit more on the format of the input file?

is it a set of serialized thrift records of the same class type? the current ThriftDeserializer
expects serialized records to be embedded inside a BytesWritable (we make sure of this during
the loading process) - but probably not the scenario for most people (we haven't gotten around
to fixing this yet)

-----Original Message-----
From: Stephen Corona []
Sent: Friday, March 06, 2009 8:05 PM
Subject: RE: Querying JSON/Thrift data?

I took a look at this class and tried to give it a shot.. I'm not exactly sure what the create
table syntax should look like. I tried this:

hive> create table testing ( uid int, name string )
    > row format serializer 'org.apache.hadoop.hive.serde2.ThriftDeserializer'
    > ;
FAILED: Parse Error: line 2:7 mismatched input 'table' expecting TEMPORARY in create function

Steve Corona
From: Prasad Chakka []
Sent: Friday, March 06, 2009 7:33 PM
Subject: Re: Querying JSON/Thrift data?

Can you use ThriftDeserializer? Look at Complex class to see how it is used.


From: Stephen Corona <>
Reply-To: <>
Date: Fri, 6 Mar 2009 16:02:02 -0800
To: <>
Subject: RE: Querying JSON/Thrift data?

From: Stephen Corona
Sent: Friday, March 06, 2009 6:16 PM
Subject: Querying JSON/Thrift data?

Hey guys,

I am currently loading data into Hive in a CSV delimited format. This works but turns out
to be a huge pain when adding and removing columns (since they can only be added to the end
of the table). Is there any way to load and query data that's in some sort of JSON/thrift
format? That way the data is already associated with some column and not just in a seemingly
arbitrary data format? I am pretty open on which format to use and how to load it into Hive.
FWIW, Our data is generated in PHP and pushed to Scribe. Scribe aggregates the CSV files and
we load them into Hive every night.



View raw message