hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Liddell <Mike.Lidd...@microsoft.com>
Subject RE: Should the SQL schema for HIVE be identical to the SQL schema for a relational database?
Date Tue, 09 Apr 2013 04:56:09 GMT
Some general guidance would be to aim for minimal JOINs in your regular queries.
Thus the biggest change from a normal-form RDBMS schema is to denormalize such that joins
do not come in to play until a query has already performed data reduction via filtering or
aggregation. This implies a star schema comprising one primary fact table that has sufficient
data in it to sensibly partition it and support direct filtering and aggregations.
Supplementing the main table will be dimension tables that can provide additional data to
flesh out result sets.

It looks like your data will be highly amenable to this.

-mike.

From: Matthieu Labour [mailto:matthieu@actionx.com]
Sent: Monday, April 08, 2013 3:50 PM
To: user@hive.apache.org
Subject: Should the SQL schema for HIVE be identical to the SQL schema for a relational database?

Hi

It would be terrific to get some advice on migrating a schema from RDMS to Hive.

Should the SQL schema for HIVE be identical to the SQL schema for a Posgresql/mysql database?

Specifically:

I have an application that generates events that look like the following:

{"ts":N+1,"userId":"123","event":"location","payload":{"verticalAccuracy":10,"longitude":-73.99718090313884,"latitude":40.72473278788106,"altitude":27.79653739929199,"horizontalAccuracy":65}}
{"ts":N+2,"userId":"123","event":"addProduct","payload":["cart","osprey-kestrel-48"]}
...

Events are being written to persistent storage (AWS S3). A 'worker' wakes up periodically,
reads the new events received and inserts them in a postgresql database. The database has
 user, product, user_product (user_id, product_id, action:(viewed|wishlist...), timestamp),
location etc... tables.

We are migrating to HIVE.

Should we also create user, product, user_product, locations etc... as HIVE tables and have
a MapReduce job process event files to populate the HIVE tables? Or should/can we implement
a different schema that would allow for external HIVE tables to map directly to the event
files generated. Or a mix of both?

Thank you for your help!

-matt

Mime
View raw message