hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Robertson <>
Subject Re: Case Studies for 'Programming Hive' book from O'Reilly
Date Sun, 15 Apr 2012 11:06:00 GMT
Hi Jason,

I work for an international organization involved in the mobilization of
biodiversity data (specifically we are dealing a lot with observations of
species) so think of it as a lot of point based information with metadata
tags.  We have built an Oozie workflow that uses Sqoop to suck in a few
databases and then does a big transformation and set of quality control
which we did using Hive and some custom UDFs.  There is a blog introducing
this on

All our work and data are open, so I can freely write about any of it, and
can link to real production code in Google svn.

If it would be of interest to you I am happy to discuss what would be most
useful to help write up for your book.  Some possible angles you might
- real UDFs in action (e.g. parsing species scientific names)
- UDTFs to generate a Google map tile cache
- Hive in an ETL workflow to remove load from DBs
- The pros and cons of calling web services from a UDF (we do it, but it
keeps concerns clean and accept the risk of a DDoS we can control)
- Sqoop and Hive together
- We are getting into Hive on HBase and have found UDFs can help with type
safety since we aren't running HIVE-1634
  [with the advancements in Hive 0.9 I would think our workarounds are not
worth documenting]
- Metrics illustrating the importance of join order, and knowing data
cardinality to ensure decent performance.

Hope this is of interest,

On Wed, Apr 11, 2012 at 7:48 PM, Jason Rutherglen <> wrote:

> Dear Hive User,
> We want your interesting case study for our upcoming book titled
> 'Programming Hive' from O'Reilly.
> How you use Hive, either high level or low level code details are both
> encouraged!
> Feel free to reach out with a brief abstract.
> Regards,
> Jason Rutherglen

View raw message