hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Wong <sw...@netflix.com>
Subject RE: Deserializing map column via JDBC (HIVE-1378)
Date Wed, 01 Sep 2010 02:29:41 GMT
Upon further inspection, LazySimpleSerDe has the ability to serialize non-primitives into JSON
but doesn't have the reverse ability to deserialize JSON back.

Here's my proposal:

1. By default, hive.fetch.output.format = display.
2. When JDBC driver connects to Hive server, execute "set hive.fetch.output.format = ctrl".
3. In Hive server:
  (a) If hive.fetch.output.format == display, FetchTask initializes LazySimpleSerDe as it
does today (field delimiter = tab, null sequence = "NULL", useJSONSerialize = true).
  (b) If hive.fetch.output.format == ctrl, FetchTask initializes LazySimpleSerDe to ctrl-delimit
everything. This is LazySimpleSerDe's default behavior anyway if it's initialized with the
schema (it isn't today).
4. JDBC driver deserializes with LazySimpleSerDe instead of DynamicSerDe.

Your feedback?

My only remaining concern is that, for "select * from partitioned_table", 3(b) might require
fixing HIVE-1573 together, because I hit some partition-column problem when I tried 3(b) in
the debugger. I hope HIVE-1573 can be fixed separately, but I don't know yet, I'll have to


-----Original Message-----
From: Steven Wong 
Sent: Friday, August 27, 2010 2:24 PM
To: hive-dev@hadoop.apache.org; 'John Sichi'
Cc: Zheng Shao; Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

A related jira is HIVE-1606 (For a null value in a string column, JDBC driver returns the
string "NULL"). What happens is the sever-side serde already turns the null into "NULL". Both
null and "NULL" are serialized as "NULL"; the client-side serde has no hope. I bring this
jira up to point out that JDBC's server side uses a serialization format that appears intended
for display (human consumption) instead of deserialization. The mixing of non-JSON and JSON
serializations is perhaps another manifestation.

Also, fixing HIVE-1606 will obviously require a server-side change. Both HIVE-1606 and HIVE-1378
(the jira at hand) can share some server-side change, if HIVE-1378 ends up changing the sever
side too.


-----Original Message-----
From: John Sichi [mailto:jsichi@facebook.com] 
Sent: Friday, August 27, 2010 11:29 AM
To: Steven Wong
Cc: Zheng Shao; hive-dev@hadoop.apache.org; Jerome Boulon
Subject: Re: Deserializing map column via JDBC (HIVE-1378)

I don't know enough about the serdes to say whether that's a problem...maybe someone else
does?  It seems like as long as the JSON form doesn't include the delimiter unescaped, it
might work?


On Aug 26, 2010, at 6:29 PM, Steven Wong wrote:

That sounds like it'll work, at least conceptually. But if the row contains primitive and
non-primitive columns, the row serialization will be a mix of non-JSON and JSON serializations,
right? Is that a good thing?

From: John Sichi [mailto:jsichi@facebook.com]
Sent: Thursday, August 26, 2010 12:11 PM
To: Steven Wong
Cc: Zheng Shao; hive-dev@hadoop.apache.org<mailto:hive-dev@hadoop.apache.org>; Jerome
Subject: Re: Deserializing map column via JDBC (HIVE-1378)

If you replace DynamicSerDe with LazySimpleSerDe on the JDBC client side, can't you then tell
it to expect JSON serialization for the maps?  That way you can leave the FetchTask server
side as is.


On Aug 24, 2010, at 2:50 PM, Steven Wong wrote:

I got sidetracked for awhile....

Looking at client.fetchOne, it is a call to the Hive server, which shows the following call

SerDeUtils.getJSONString(Object, ObjectInspector) line: 205
LazySimpleSerDe.serialize(Object, ObjectInspector) line: 420
FetchTask.fetch(ArrayList<String>) line: 130
Driver.getResults(ArrayList<String>) line: 660
HiveServer$HiveServerHandler.fetchOne() line: 238

In other words, FetchTask.mSerde (an instance of LazySimpleSerDe) serializes the map column
into JSON strings. It's because FetchTask.mSerde has been initialized by FetchTask.initialize
to do it that way.

It appears that the fix is to initialize FetchTask.mSerde differently to do ctrl-serialization
instead - presumably for the JDBC use case only and not for other use cases of FetchTask.
Further, it appears that FetchTask.mSerde will do ctrl-serialization if it is initialized
(via the properties "columns" and "columns.types") with the proper schema.

Are these right? Pointers on how to get the proper schema? (From FetchTask.work?) And on how
to restrict the change to JDBC only? (I have no idea.)

For symmetry, LazySimpleSerDe should be used to do ctrl-deserialization on the client side,
per Zheng's suggestion.


From: Zheng Shao [mailto:zshao@facebook.com]
Sent: Monday, August 16, 2010 3:57 PM
To: Steven Wong; hive-dev@hadoop.apache.org<mailto:hive-dev@hadoop.apache.org>
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)

I think the call to client.fetchOne should use delimited format, so that DynamicSerDe can
deserialize it.
This should be a good short-term fix.

Also on a higher level, DynamicSerDe is deprecated.  It will be great to use LazySimpleSerDe
to handle all serialization/deserializations instead.

From: Steven Wong [mailto:swong@netflix.com]
Sent: Friday, August 13, 2010 7:02 PM
To: Zheng Shao; hive-dev@hadoop.apache.org<mailto:hive-dev@hadoop.apache.org>
Cc: Jerome Boulon
Subject: Deserializing map column via JDBC (HIVE-1378)

Trying to work on HIVE-1378. My first step is to get the Hive JDBC driver to return actual
values for mapcol in the result set of "select mapcol, bigintcol, stringcol from foo", where
mapcol is a map<string,string> column, instead of the current behavior of complaining
that mapcol's column type is not recognized.

I changed HiveResultSetMetaData.{getColumnType,getColumnTypeName} to recognize the map type,
but then the returned value for mapcol is always {}, even though mapcol does contain some
key-value entries. Turns out this is happening in HiveQueryResultSet.next:

1.       The call to client.fetchOne returns the string "{"a":"b","x":"y"}   123         abc".
2.       The serde (DynamicSerDe ds) deserializes the string to the list [{},123,"abc"].

The serde cannot correctly deserialize the map because apparently the map is not in the serde's
expected serialization format. The serde has been initialized with TCTLSeparatedProtocol.

Should we make client.fetchOne return a ctrl-separated string? Or should we use a different
serde/format in HiveQueryResultSet? It seems the first way is right; correct me if that's
wrong. And how do we do that?


View raw message