Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
Received-SPF: pass (athena.apache.org: domain of scoulibaly@gmail.com
 designates 74.125.82.43 as permitted sender)
MIME-Version: 1.0
From: =?UTF-8?Q?S=C3=A9kine_Coulibaly?= <scoulibaly@gmail.com>
Date: Mon, 4 Mar 2013 23:33:05 +0100
Message-ID: 
 <CAD8n-Fo3Gy5pD8XdJn=nt30WrXWEmS40sYcgx+h4AgJo+1dUpQ@mail.gmail.com>
Subject: Best table storage for analytical use case
To: user@hive.apache.org
Content-Type: multipart/alternative; boundary=047d7bf0d52e954b5704d720f23b

--047d7bf0d52e954b5704d720f23b
Content-Type: text/plain; charset=UTF-8

Hi there,

I've setup a virtual machine hosting Hive.
My use case is a Web traffic analytics, hence most of requests are :

- how many requests today ?
- how many request today, grouped by country ?
- most requested urls ?
- average http server response time (5 minutes slots) ?

In other words, lets consider :
CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )
and

SELECT COUNT(*) FROM logs;
SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
SELECT COUNT(*),url FROM logs BROUP BY url;
SELECT AVG(http_rt) FROM logs ...

2 questions here :
- How to generate 5 minutes slots to make my averages (in Postgresql, I
used to generate_series() and JOIN) ? I wish I could avoid doing multiple
requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
mapping the date string to a aslot number ?

- What is the best storage method pour this table ? Since it's purpose is
analytical, I thought columnar format was the way to go. So I tried RCFILE
buy the results are as follow for around 1 million rows (quite small, I
know) and are quite the opposite I was expecting :

Storage / query duration / disk table size
TEXTFILE / 22 seconds / 250MB
RCFILE / 31 seconds / 320 MB

 I thought getting values in columns would speed up the aggregate process.
Maybe the dataset is too small to tell, or I missed something ? Will adding
Snappy compression help (not sure whether RCFiles are compressed or not) ?

Thank you !

--047d7bf0d52e954b5704d720f23b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi there,<div><br></div><div style>I&#39;ve setup a virtua=
l machine hosting Hive.</div><div style>My use case is a Web traffic analyt=
ics, hence most of requests are :</div><div style><br></div><div style>- ho=
w many requests today ?</div>

<div style><div>- how many request today, grouped by country ?</div><div>- =
most requested urls ?<br></div></div><div style>- average http server respo=
nse time (5 minutes slots) ?<br></div><div style><br></div><div style>
In other words, lets consider :</div>
<div style>CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT=
 )</div><div style>and=C2=A0</div><div style><br></div><div style>SELECT CO=
UNT(*) FROM logs;</div><div style>SELECT COUNT(*),orig_country FROM logs GR=
OUP BY orig_country;</div>

<div style>SELECT COUNT(*),url FROM logs BROUP BY url;</div><div style>SELE=
CT AVG(http_rt) FROM logs ...</div><div style><br></div><div style>2 questi=
ons here :</div><div style>- How to generate 5 minutes slots to make my ave=
rages (in Postgresql, I used to generate_series() and JOIN) ? I wish I coul=
d avoid doing multiple requests each with a &#39;WHERE date&gt;... AND date=
 &lt;...&#39;. Maybe a mapper, mapping the date string to a aslot number ?<=
/div>

<div style><br></div><div style>- What is the best storage method pour this=
 table ? Since it&#39;s purpose is analytical, I thought columnar format wa=
s the way to go. So I tried RCFILE buy the results are as follow for around=
 1 million rows (quite small, I know) and are quite the opposite I was expe=
cting :</div>

<div style><br></div><div style>Storage / query duration / disk table size<=
/div><div style>TEXTFILE / 22 seconds / 250MB</div><div style>RCFILE / 31 s=
econds / 320 MB</div><div style><br></div><div style>=C2=A0I thought gettin=
g values in columns would speed up the aggregate process. Maybe the dataset=
 is too small to tell, or I missed something ? Will adding Snappy compressi=
on help (not sure whether RCFiles are compressed or not) ?</div>

<div style><br></div><div style>Thank you !</div><div style><br></div><div =
style><br></div><div style><br></div></div>

--047d7bf0d52e954b5704d720f23b--