Return-Path: Delivered-To: apmail-hadoop-hive-dev-archive@minotaur.apache.org Received: (qmail 69077 invoked from network); 1 Sep 2010 17:17:48 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 Sep 2010 17:17:48 -0000 Received: (qmail 63445 invoked by uid 500); 1 Sep 2010 17:17:47 -0000 Delivered-To: apmail-hadoop-hive-dev-archive@hadoop.apache.org Received: (qmail 63374 invoked by uid 500); 1 Sep 2010 17:17:47 -0000 Mailing-List: contact hive-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hive-dev@hadoop.apache.org Delivered-To: mailing list hive-dev@hadoop.apache.org Delivered-To: moderator for hive-dev@hadoop.apache.org Received: (qmail 18383 invoked by uid 99); 1 Sep 2010 10:22:17 -0000 X-ASF-Spam-Status: No, hits=2.7 required=10.0 tests=RCVD_ILLEGAL_IP,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of zshao@facebook.com designates 69.63.178.164 as permitted sender) From: Zheng Shao To: Steven Wong , "hive-dev@hadoop.apache.org" , John Sichi CC: Jerome Boulon Subject: RE: Deserializing map column via JDBC (HIVE-1378) Thread-Topic: Deserializing map column via JDBC (HIVE-1378) Thread-Index: Acs7VLAq3wUmtg7+SqKVqWrAclCsSQCQYsiQAYm3L+AAc/u9AAABt/cQAC8iboAACg6kMACHAIqgADSx/tA= Date: Wed, 1 Sep 2010 10:21:45 +0000 Message-ID: <4DD3AF6CC8239B439D238D14BC0D6401FC102C@sc-mbx06.TheFacebook.com> References: <4F6B25AFFFCAFE44B6259A412D5F9B101BDFB8DB@ExchMBX104.netflix.com> <4DD3AF6CC8239B439D238D14BC0D6401ECF6A5@sc-mbx05.TheFacebook.com> <4F6B25AFFFCAFE44B6259A412D5F9B101C07A166@ExchMBX104.netflix.com> <765AC083-593D-40B4-859B-CA92D358F810@facebook.com> <4F6B25AFFFCAFE44B6259A412D5F9B101C07A202@ExchMBX104.netflix.com> <35C7FE7B-A30B-4B5B-A32C-CF6A6CE7AA7E@facebook.com> <4F6B25AFFFCAFE44B6259A412D5F9B101C07A237@ExchMBX104.netflix.com> In-Reply-To: <4F6B25AFFFCAFE44B6259A412D5F9B101C07A237@ExchMBX104.netflix.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Hi Steven, Sorry for the late reply. The email slipped my eye... This issue was brought up multiple times. In my opinion, using JSON in Laz= ySimpleSerDe (inherited from ColumnsetSerDe, MetadataColumnsetSerDe, Dynami= cSerDe) was a long-time legacy problem that never got fixed. LazySimpleSe= rDe was supposed to do delimited format only. The cleanest way to do that is to: 1. Get rid of the JSON-related logic in LazySimpleSerDe; 2. Introduce another "DelimitedJSONSerDe" (without deserialization capabili= ty) that does JSON serialization for complex fields. (We never have or nee= d deserialization for JSON yet) 3. Configure the FetchTask to use the new SerDe by default, and LazySimpleS= erDe in case it's JDBC. This is for serialization only. We might need to = have 2 SerDe fields in FetchTask - one for deserialization the data from fi= le, one for serialization the data to stdout/jdbc etc. I can help review the code (please ping me) if you decide to go down this r= oute. Zheng -----Original Message----- From: Steven Wong [mailto:swong@netflix.com]=20 Sent: Monday, August 30, 2010 3:46 PM To: hive-dev@hadoop.apache.org; John Sichi Cc: Zheng Shao; Jerome Boulon Subject: RE: Deserializing map column via JDBC (HIVE-1378) Any guidance on how I/we should proceed on HIVE-1378 and HIVE-1606? -----Original Message----- From: Steven Wong=20 Sent: Friday, August 27, 2010 2:24 PM To: hive-dev@hadoop.apache.org; 'John Sichi' Cc: Zheng Shao; Jerome Boulon Subject: RE: Deserializing map column via JDBC (HIVE-1378) A related jira is HIVE-1606 (For a null value in a string column, JDBC driv= er returns the string "NULL"). What happens is the sever-side serde already= turns the null into "NULL". Both null and "NULL" are serialized as "NULL";= the client-side serde has no hope. I bring this jira up to point out that = JDBC's server side uses a serialization format that appears intended for di= splay (human consumption) instead of deserialization. The mixing of non-JSO= N and JSON serializations is perhaps another manifestation. Also, fixing HIVE-1606 will obviously require a server-side change. Both HI= VE-1606 and HIVE-1378 (the jira at hand) can share some server-side change,= if HIVE-1378 ends up changing the sever side too. Steven -----Original Message----- From: John Sichi [mailto:jsichi@facebook.com]=20 Sent: Friday, August 27, 2010 11:29 AM To: Steven Wong Cc: Zheng Shao; hive-dev@hadoop.apache.org; Jerome Boulon Subject: Re: Deserializing map column via JDBC (HIVE-1378) I don't know enough about the serdes to say whether that's a problem...mayb= e someone else does? It seems like as long as the JSON form doesn't includ= e the delimiter unescaped, it might work? JVS On Aug 26, 2010, at 6:29 PM, Steven Wong wrote: That sounds like it'll work, at least conceptually. But if the row contains= primitive and non-primitive columns, the row serialization will be a mix o= f non-JSON and JSON serializations, right? Is that a good thing? From: John Sichi [mailto:jsichi@facebook.com] Sent: Thursday, August 26, 2010 12:11 PM To: Steven Wong Cc: Zheng Shao; hive-dev@hadoop.apache.org; Jerome Boulon Subject: Re: Deserializing map column via JDBC (HIVE-1378) If you replace DynamicSerDe with LazySimpleSerDe on the JDBC client side, c= an't you then tell it to expect JSON serialization for the maps? That way = you can leave the FetchTask server side as is. JVS On Aug 24, 2010, at 2:50 PM, Steven Wong wrote: I got sidetracked for awhile.... Looking at client.fetchOne, it is a call to the Hive server, which shows th= e following call stack: SerDeUtils.getJSONString(Object, ObjectInspector) line: 205 LazySimpleSerDe.serialize(Object, ObjectInspector) line: 420 FetchTask.fetch(ArrayList) line: 130 Driver.getResults(ArrayList) line: 660 HiveServer$HiveServerHandler.fetchOne() line: 238 In other words, FetchTask.mSerde (an instance of LazySimpleSerDe) serialize= s the map column into JSON strings. It's because FetchTask.mSerde has been = initialized by FetchTask.initialize to do it that way. It appears that the fix is to initialize FetchTask.mSerde differently to do= ctrl-serialization instead - presumably for the JDBC use case only and not= for other use cases of FetchTask. Further, it appears that FetchTask.mSerd= e will do ctrl-serialization if it is initialized (via the properties "colu= mns" and "columns.types") with the proper schema. Are these right? Pointers on how to get the proper schema? (From FetchTask.= work?) And on how to restrict the change to JDBC only? (I have no idea.) For symmetry, LazySimpleSerDe should be used to do ctrl-deserialization on = the client side, per Zheng's suggestion. Steven From: Zheng Shao [mailto:zshao@facebook.com] Sent: Monday, August 16, 2010 3:57 PM To: Steven Wong; hive-dev@hadoop.apache.org Cc: Jerome Boulon Subject: RE: Deserializing map column via JDBC (HIVE-1378) I think the call to client.fetchOne should use delimited format, so that Dy= namicSerDe can deserialize it. This should be a good short-term fix. Also on a higher level, DynamicSerDe is deprecated. It will be great to us= e LazySimpleSerDe to handle all serialization/deserializations instead. Zheng From: Steven Wong [mailto:swong@netflix.com] Sent: Friday, August 13, 2010 7:02 PM To: Zheng Shao; hive-dev@hadoop.apache.org Cc: Jerome Boulon Subject: Deserializing map column via JDBC (HIVE-1378) Trying to work on HIVE-1378. My first step is to get the Hive JDBC driver t= o return actual values for mapcol in the result set of "select mapcol, bigi= ntcol, stringcol from foo", where mapcol is a map column, in= stead of the current behavior of complaining that mapcol's column type is n= ot recognized. I changed HiveResultSetMetaData.{getColumnType,getColumnTypeName} to recogn= ize the map type, but then the returned value for mapcol is always {}, even= though mapcol does contain some key-value entries. Turns out this is happe= ning in HiveQueryResultSet.next: 1. The call to client.fetchOne returns the string "{"a":"b","x":"y"} = 123 abc". 2. The serde (DynamicSerDe ds) deserializes the string to the list [{= },123,"abc"]. The serde cannot correctly deserialize the map because apparently the map i= s not in the serde's expected serialization format. The serde has been init= ialized with TCTLSeparatedProtocol. Should we make client.fetchOne return a ctrl-separated string? Or should we= use a different serde/format in HiveQueryResultSet? It seems the first way= is right; correct me if that's wrong. And how do we do that? Thanks. Steven