hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From shrikanth shankar <sshan...@qubole.com>
Subject Re: What's the right data storage/representation?
Date Tue, 15 May 2012 17:14:23 GMT
I would agree on keeping track of the history of updates in a separate table in Hive (you may
not need to maintain it in the application tier). This pattern seems to be the "Slowly Changing
Dimension" pattern used in other (more traditional) Data Warehouses...  I suspect the challenge
here would be writing a ETL process to maintain the Hive table based on the current status
of the application db table ..

Shrikanth
On May 15, 2012, at 9:41 AM, Owen O'Malley wrote:

> On Tue, May 15, 2012 at 5:11 AM, Jon Palmer <jpalmer@care.com> wrote:
>> I can see a few potential solutions:
>> 
>> 1.       Don’t solve it. Accept that you have some artifacts in your
>> reporting data that cannot be recovered from the source data.
>> 
>> 2.       Create status and location history tables in the application db and
>> use that during the analytics process.
>> 
>> 3.       Log the status and location change ‘events’ to some other log file
>> and use those logs in the Hive analysis.
> 
> I would probably create a Hive table that includes the status and
> location updates. One of the advantages of Hive & Hadoop is that it is
> easy to store the raw information in bulk and continue to process it.
> Once you have the information, you will likely find new uses for it.
> 
> -- Owen


Mime
View raw message