hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Palmer <>
Subject What's the right data storage/representation?
Date Tue, 15 May 2012 12:11:53 GMT

I'm a relative newcomer to Hadoop/Hive. We have a very standard setup of multiple webapp servers
backed by a mySql database. We are evaluating Hive as a high scale solution for our relatively
sophisticated reporting and analytics needs. However, it's not clear what the best practices
are around storing and representing the data our application generates. Probably best explained
with an example:

We imagine a Hive deployment that is importing Apache logs and MySql data from the application
db (probably via Sqoop). We would run our analysis daily and output the results somewhere
(flat files in s3 or another MySql reporting database). We have users that have a) a status
(Basic or Premium) and b) a location (a Zip code). We'd like to be able to ask questions like
"How many premium users did we have within ten miles of zip 02110 on Jan 3rd 2012?" Computing
these numbers for all dates across all zip codes and for a number of radi on a very large
set of users seems like a pretty good use of Hadoop/Hive.

However users can move location and change status. The application database only really cares
about the current location and status of a user and not the history of those fields. This
presents a challenge to the analytics process. If we run the analysis every day we will naturally
pick up the changes in status and location. However, if we were to try to recomputed our entire
analysis for all dates we would get different results for users that moved location or changed
status. The Apache logs are like not of much use as they are unlikely to contain member ids
to deduce the requests which resulted in the change of status or location for a user.

How is this type of problem typically solved with Hive?

I can see a few potential solutions:

1.       Don't solve it. Accept that you have some artifacts in your reporting data that cannot
be recovered from the source data.

2.       Create status and location history tables in the application db and use that during
the analytics process.

3.       Log the status and location change 'events' to some other log file and use those
logs in the Hive analysis.

Are there any 'best practices' around these kinds of problems and in particular suggestions
for the simplest implementation of the extra logging and analysis required by 3.?


This email is intended for the person(s) to whom it is addressed and may contain information
that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure
by any person other than the addressee(s) is strictly prohibited. If you have received this
email in error, please notify the sender immediately by return email and delete the message
and any attachments from your system.
View raw message