hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <mgro...@oanda.com>
Subject Re: Hive for large statistics tables?
Date Tue, 27 Sep 2011 14:47:52 GMT
Hi Benjamin,
Wojciech raised some good points but I believe that Hive/Hadoop can still be useful in your
case.

MySQL solution that you presently have is not scalable. Hive is not a substitution for MySQL,
it runs on Hadoop which is a distributed batch processing system. It will allow you to crunch
*a lot* of data, amounts copious enough that stand-alone MySQL server wouldn't be able to
deal with.

Many people (including myself) use Hive/hadoop in conjunction with a relational DB. They do
much of the number crunching via Hive/Hadoop and then write the aggregates on a (fast-access)
relational DB to provide quick access to those results. However, as Wojciech pointed out,
ad-hoc queries on Hive would, in general, take longer than similar queries in MySQL. It was
designed to deal with large amounts of data, so that's just an overhead we have to live with.

I'd suggest doing some background research on how much data you have and if Hive/hadoop really
make sense. Here is a good video from Alex Loddengaard to get you started. A good slide (at
15:00) does a comparison of Hadoop with RDBMS. Later on (at 37:30), in the same video there
is an example of typical workflow with Hive and Relational DB.

Check it out and good luck!

Mark

----- Original Message -----
From: "Wojciech Langiewicz" <wlangiewicz@gmail.com>
To: user@hive.apache.org
Sent: Tuesday, September 27, 2011 9:33:53 AM
Subject: Re: Hive for large statistics tables?

Hello,
I'm using Hive to query data like yours. In my case I have about 300 - 
500GB data per day, so it is much larger. We use Flume to load data into 
Hive - data is rolled every day (this can be changed).

Hive queries - ad-hoc or scheduled usually take at least 10-20s or more 
(possibly hours) - it won't speed up your processing. Hive shows it 
power when you reach more data than serveral GB per month.

I think, that in your case Hive is not a good solution, you'll be better 
off using more powerful MySQL servers.

On 27.09.2011 11:14, Benjamin Fonze wrote:
> Dear All,
>
> I'm new to this list, and I hope I'm sending this to the right place.
>
> I'm currently using MySQL to store a large amount of visitor statistics.
> (Visits, clicks, etc....)
>
> Basically, each visit is logged in a text file, and every 15 minutes, a job
> consolidate it into MySQL, into tables that looks like this :
>
> COUNTRY | DATE | USER_AGENT | REFERRER | SEARCH | ... | NUM_HITS
>
> This generates million of rows a month, and several GB of data. Then, when
> querying these tables, it would typically take a few seconds. (Yes, there
> are indexes, etc...)
>
> I was thinking to move all that data to a noSQL DB like Hive, but I want to
> make sure it is adapted to my purpose. Can you confirm that Hive is a good
> fit for such statistical data. More importantly, can you confirm that ad-hoc
> queries on that data will be much faster that MySQL?
>
> Thanks in advance!
>
> Benjamin.
>


Mime
View raw message