hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shrikanth shankar <sshan...@qubole.com>
Subject Re: Writing Custom Serdes for Hive
Date Tue, 16 Oct 2012 16:09:26 GMT
I think what you need is a custom Input Format/ Record Reader. By the time the SerDe is called
the row has been fetched. I believe the record reader can get access to predicates. The code
to access HBase from Hive needs it for the same reasons as you would need with Mongo and might
be a good place to start. 

thanks,
Shrikanth
On Oct 16, 2012, at 8:54 AM, John Omernik wrote:

> There reason I am asking (and maybe YC reads this list and can chime in) but he has written
a connector for MongoDB.  It's simple, basically it connects to a MongoDB, maps columns (primitives
only) to mongodb fields, and allows you to select out of Mongo. Pretty sweet actually, and
with Mongo, things are really fast for small tables.  
> 
> 
> That being said, I noticed that his connector basically gets all rows from a Mongo DB
collection every time it's ran.  And we wanted to see if we could extend it to do some simple
MongoDB level filtering based on the passed query.  Basically have a fail open approach...
if it saw something it thought it could optimize in the mongodb query to limit data, it would,
otherwise, it would default to the original approach of getting all the data.  
> 
> 
> For example:
> 
> select * from mongo_table where name rlike 'Bobby\\sWhite'
> 
> Current method: the connection do db.collection.find() gets all the documents from MongoDB,
and then hive does the regex.  
> 
> Thing we want to try "Oh one of our defined mongo columns has a rlike, ok send this instead:
db.collection.find("name":/Bobby\sWhite");   less data that would need to be transfered. Yes,
Hive would still run the rlike on the data... "shrug" at least it's running it on far less
data.   Basically if we could determine shortcuts, we could use them. 
> 
> 
> Just trying to understand Serdes and how we are completely not using them as intended
:) 
> 
> 
> 
> 
> On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck <Chuck.Connell@nuance.com> wrote:
> A serde is actually used the other way around… Hive parses the query, writes MapReduce
code to solve the query, and the generated code uses the serde for field access.
> 
>  
> 
> Standard way to write a serde is to start from the trunk regex serde, then modify as
needed…
> 
>  
> 
> http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106&view=markup
> 
> 
> Also, nice article by Roberto Congiu…
> 
>  
> 
> http://www.congiu.com/a-json-readwrite-serde-for-hive/
> 
>  
> 
> Chuck Connell
> 
> Nuance R&D Data Team
> 
> Burlington, MA
> 
>  
> 
>  
> 
> From: John Omernik [mailto:john@omernik.com] 
> Sent: Tuesday, October 16, 2012 11:30 AM
> To: user@hive.apache.org
> Subject: Writing Custom Serdes for Hive
> 
>  
> 
> We have a maybe obvious question about a serde. When a serde in invoked, does it have
access to the original hive query?  Ideally the original query could provide the Serde some
hints on how to access the data on the backend.  
> 
>  
> 
> Also, are there any good links/documention on how to write Serdes?  Kinda hard to google
on for some reason. 
> 
>  
> 
>  
> 
> 


Mime
View raw message