hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From S├ękine Coulibaly <>
Subject Best table storage for analytical use case
Date Mon, 04 Mar 2013 22:33:05 GMT
Hi there,

I've setup a virtual machine hosting Hive.
My use case is a Web traffic analytics, hence most of requests are :

- how many requests today ?
- how many request today, grouped by country ?
- most requested urls ?
- average http server response time (5 minutes slots) ?

In other words, lets consider :
CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )

SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
SELECT AVG(http_rt) FROM logs ...

2 questions here :
- How to generate 5 minutes slots to make my averages (in Postgresql, I
used to generate_series() and JOIN) ? I wish I could avoid doing multiple
requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
mapping the date string to a aslot number ?

- What is the best storage method pour this table ? Since it's purpose is
analytical, I thought columnar format was the way to go. So I tried RCFILE
buy the results are as follow for around 1 million rows (quite small, I
know) and are quite the opposite I was expecting :

Storage / query duration / disk table size
TEXTFILE / 22 seconds / 250MB
RCFILE / 31 seconds / 320 MB

 I thought getting values in columns would speed up the aggregate process.
Maybe the dataset is too small to tell, or I missed something ? Will adding
Snappy compression help (not sure whether RCFiles are compressed or not) ?

Thank you !

View raw message