Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 987F4DF4B for ; Tue, 16 Oct 2012 16:10:05 +0000 (UTC) Received: (qmail 27069 invoked by uid 500); 16 Oct 2012 16:10:04 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 26837 invoked by uid 500); 16 Oct 2012 16:10:04 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 26828 invoked by uid 99); 16 Oct 2012 16:10:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Oct 2012 16:10:04 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.220.48] (HELO mail-pa0-f48.google.com) (209.85.220.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Oct 2012 16:09:56 +0000 Received: by mail-pa0-f48.google.com with SMTP id kp12so7492939pab.35 for ; Tue, 16 Oct 2012 09:09:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:mime-version:content-type:subject:date:in-reply-to:to :references:message-id:x-mailer:x-gm-message-state; bh=YCYfhUX+a4O7a9sv5l9h//Dn06FTWnmEprEijG+Yf+A=; b=R7eHJEpnbVwuJhU2w46sMDLHM8HbqRuXM+rBnd5uZlsNyyhEoGgnvoV5f2p8f+H0V8 IiQuXOC2h0BGGNgJlrQsLMYbyjcLaI3zjT5ihhdRaZZ2P6RYPPm+5FMmFhHuX/cjQMP2 b+GY0TK3xbUfdVWIQadHEjUeiIp36za8Q0SddebWRy9vC9l6ess5RTyaIqb9Wk3DqhQj 8Mcr8cceFqhQQXmca8TV3YZdgCMIbtc4azC1o67g2VQ5O9oDyKhL3/ZA6p8ob7z3mXYD IPXueOjwGmxMP/ld8fpX2oAcOS+S4v+XutCijC1cNKPzwWmiaDXqIpMAe0l3r+1oTShL zb1A== Received: by 10.68.204.137 with SMTP id ky9mr47919537pbc.90.1350403773260; Tue, 16 Oct 2012 09:09:33 -0700 (PDT) Received: from [192.168.0.124] (c-50-136-240-155.hsd1.ca.comcast.net. [50.136.240.155]) by mx.google.com with ESMTPS id ky5sm7740026pbc.40.2012.10.16.09.09.28 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 16 Oct 2012 09:09:32 -0700 (PDT) From: shrikanth shankar Mime-Version: 1.0 (Apple Message framework v1278) Content-Type: multipart/alternative; boundary="Apple-Mail=_00D5BA18-82ED-4B8A-9BED-A37F0BDEC2B0" Subject: Re: Writing Custom Serdes for Hive Date: Tue, 16 Oct 2012 09:09:26 -0700 In-Reply-To: To: user@hive.apache.org References: <9D8A350A3269554E91B45801B5E8CDAC68585A@SOM-EXCH02.nuance.com> Message-Id: X-Mailer: Apple Mail (2.1278) X-Gm-Message-State: ALoCoQmsTLfhO8XrNENfL5zjSyCxPMFOjrjJJWQGHaPXwRMaliSeFYUzBB+7gz9CqmIqAd2YcJDc X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_00D5BA18-82ED-4B8A-9BED-A37F0BDEC2B0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 I think what you need is a custom Input Format/ Record Reader. By the = time the SerDe is called the row has been fetched. I believe the record = reader can get access to predicates. The code to access HBase from Hive = needs it for the same reasons as you would need with Mongo and might be = a good place to start.=20 thanks, Shrikanth On Oct 16, 2012, at 8:54 AM, John Omernik wrote: > There reason I am asking (and maybe YC reads this list and can chime = in) but he has written a connector for MongoDB. It's simple, basically = it connects to a MongoDB, maps columns (primitives only) to mongodb = fields, and allows you to select out of Mongo. Pretty sweet actually, = and with Mongo, things are really fast for small tables. =20 >=20 >=20 > That being said, I noticed that his connector basically gets all rows = from a Mongo DB collection every time it's ran. And we wanted to see if = we could extend it to do some simple MongoDB level filtering based on = the passed query. Basically have a fail open approach... if it saw = something it thought it could optimize in the mongodb query to limit = data, it would, otherwise, it would default to the original approach of = getting all the data. =20 >=20 >=20 > For example: >=20 > select * from mongo_table where name rlike 'Bobby\\sWhite' >=20 > Current method: the connection do db.collection.find() gets all the = documents from MongoDB, and then hive does the regex. =20 >=20 > Thing we want to try "Oh one of our defined mongo columns has a rlike, = ok send this instead: db.collection.find("name":/Bobby\sWhite"); less = data that would need to be transfered. Yes, Hive would still run the = rlike on the data... "shrug" at least it's running it on far less data. = Basically if we could determine shortcuts, we could use them.=20 >=20 >=20 > Just trying to understand Serdes and how we are completely not using = them as intended :)=20 >=20 >=20 >=20 >=20 > On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck = wrote: > A serde is actually used the other way around=85 Hive parses the = query, writes MapReduce code to solve the query, and the generated code = uses the serde for field access. >=20 > =20 >=20 > Standard way to write a serde is to start from the trunk regex serde, = then modify as needed=85 >=20 > =20 >=20 > = http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop= /hive/contrib/serde2/RegexSerDe.java?revision=3D1131106&view=3Dmarkup >=20 >=20 > Also, nice article by Roberto Congiu=85 >=20 > =20 >=20 > http://www.congiu.com/a-json-readwrite-serde-for-hive/ >=20 > =20 >=20 > Chuck Connell >=20 > Nuance R&D Data Team >=20 > Burlington, MA >=20 > =20 >=20 > =20 >=20 > From: John Omernik [mailto:john@omernik.com]=20 > Sent: Tuesday, October 16, 2012 11:30 AM > To: user@hive.apache.org > Subject: Writing Custom Serdes for Hive >=20 > =20 >=20 > We have a maybe obvious question about a serde. When a serde in = invoked, does it have access to the original hive query? Ideally the = original query could provide the Serde some hints on how to access the = data on the backend. =20 >=20 > =20 >=20 > Also, are there any good links/documention on how to write Serdes? = Kinda hard to google on for some reason.=20 >=20 > =20 >=20 > =20 >=20 >=20 --Apple-Mail=_00D5BA18-82ED-4B8A-9BED-A37F0BDEC2B0 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252 I = think what you need is a custom Input Format/ Record Reader. By the time = the SerDe is called the row has been fetched. I believe the record = reader can get access to predicates. The code to access HBase from Hive = needs it for the same reasons as you would need with Mongo and might be = a good place to = start. 

thanks,
Shrikanth
O= n Oct 16, 2012, at 8:54 AM, John Omernik wrote:

There = reason I am asking (and maybe YC reads this list and can chime in) but = he has written a connector for MongoDB.  It's simple, basically it = connects to a MongoDB, maps columns (primitives only) to mongodb = fields, and allows you to select out of Mongo. Pretty sweet actually, = and with Mongo, things are really fast for small tables.  


That being said, I noticed that his = connector basically gets all rows from a Mongo DB collection every time = it's ran.  And we wanted to see if we could extend it to do some = simple MongoDB level filtering based on the passed query. =  Basically have a fail open approach... if it saw something it = thought it could optimize in the mongodb query to limit data, it would, = otherwise, it would default to the original approach of getting all the = data.  


For = example:

select * from mongo_table where name = rlike 'Bobby\\sWhite'

Current method: the = connection do db.collection.find() gets all the documents from MongoDB, = and then hive does the regex.  

Thing we want to try "Oh one of our defined mongo = columns has a rlike, ok send this instead: = db.collection.find("name":/Bobby\sWhite");   less data that would = need to be transfered. Yes, Hive would still run the rlike on the = data... "shrug" at least it's running it on far less data.   = Basically if we could determine shortcuts, we could use = them. 


Just trying to understand Serdes and = how we are completely not using them as intended = :) 




On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck = <Chuck.Connell@nuance.com> wrote:

A serde = is actually used the other way around=85 Hive parses the query, writes = MapReduce code to solve the query, and the generated code uses the serde = for field access.

&nbs= p;

Standard = way to write a serde is to start from the trunk regex serde, then modify = as needed=85

 

http://svn.apache.org/viewvc/hive/trunk/contrib/src/java= /org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=3D1131106&= amp;view=3Dmarkup

Also, = nice article by Roberto Congiu=85

&nbs= p;

http://www.congiu.com/a-json-readwrite-serde-for-hive/

&nbs= p;

Chuck = Connell

Nuance = R&D Data Team

Burlington,= MA

 

 

From: John Omernik [mailto:john@omernik.com]
Sent: Tuesday, October 16, 2012 11:30 AM
To: user@hive.apache.org
Subject: Writing Custom Serdes for Hive

 

We have a maybe obvious question about a serde. When = a serde in invoked, does it have access to the original hive query? =  Ideally the original query could provide the Serde some hints on = how to access the data on the backend.  

 

Also, are there any good links/documention = on how to write Serdes?  Kinda hard to google on for some = reason. 

 

 



= --Apple-Mail=_00D5BA18-82ED-4B8A-9BED-A37F0BDEC2B0--