Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 38D80DB6F for ; Mon, 4 Mar 2013 22:34:05 +0000 (UTC) Received: (qmail 71419 invoked by uid 500); 4 Mar 2013 22:34:03 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 71380 invoked by uid 500); 4 Mar 2013 22:34:03 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 71372 invoked by uid 99); 4 Mar 2013 22:34:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Mar 2013 22:34:03 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of scoulibaly@gmail.com designates 74.125.82.43 as permitted sender) Received: from [74.125.82.43] (HELO mail-wg0-f43.google.com) (74.125.82.43) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Mar 2013 22:33:58 +0000 Received: by mail-wg0-f43.google.com with SMTP id e12so4608771wge.34 for ; Mon, 04 Mar 2013 14:33:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:from:date:message-id:subject:to :content-type; bh=aLPkqzItFSrHjRN7nofC+Ibljb9SUjoxP0MwZ+Nw8a8=; b=ZKHm3y0VJ15WiQD9IbY7vxDlDv3ozOdkw9AZ/FsVmc+dVSa13rB8QavRZvpLr6P5X4 PftqOAyc1zRGyLY3UUNay0dro9RZ20052bibOOGa61ORmbuf/4sZBStPGBf623rZ/xtL EYXVQt6KB8G0PEN7EwJ3kdNK4NBjM/QcAgovWkmBqhwiWZ+hr/rRUDwLfXANohyneK/S 4yHd6CzjAdUFjzkODZAXTokX7FrPPC2rTCQ4ivRPuA9V7tfoi8+dWbKw0/4JQ5nsw3Z1 D52JcCXRuM9br4jOIQCBVyrCGo31EQa+ta9eg2NYKxNLUU3F64TOHYc5GFbf5D3qabCy 7oUw== X-Received: by 10.194.77.129 with SMTP id s1mr35306280wjw.17.1362436415636; Mon, 04 Mar 2013 14:33:35 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.97.68 with HTTP; Mon, 4 Mar 2013 14:33:05 -0800 (PST) From: =?UTF-8?Q?S=C3=A9kine_Coulibaly?= Date: Mon, 4 Mar 2013 23:33:05 +0100 Message-ID: Subject: Best table storage for analytical use case To: user@hive.apache.org Content-Type: multipart/alternative; boundary=047d7bf0d52e954b5704d720f23b X-Virus-Checked: Checked by ClamAV on apache.org --047d7bf0d52e954b5704d720f23b Content-Type: text/plain; charset=UTF-8 Hi there, I've setup a virtual machine hosting Hive. My use case is a Web traffic analytics, hence most of requests are : - how many requests today ? - how many request today, grouped by country ? - most requested urls ? - average http server response time (5 minutes slots) ? In other words, lets consider : CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT ) and SELECT COUNT(*) FROM logs; SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country; SELECT COUNT(*),url FROM logs BROUP BY url; SELECT AVG(http_rt) FROM logs ... 2 questions here : - How to generate 5 minutes slots to make my averages (in Postgresql, I used to generate_series() and JOIN) ? I wish I could avoid doing multiple requests each with a 'WHERE date>... AND date <...'. Maybe a mapper, mapping the date string to a aslot number ? - What is the best storage method pour this table ? Since it's purpose is analytical, I thought columnar format was the way to go. So I tried RCFILE buy the results are as follow for around 1 million rows (quite small, I know) and are quite the opposite I was expecting : Storage / query duration / disk table size TEXTFILE / 22 seconds / 250MB RCFILE / 31 seconds / 320 MB I thought getting values in columns would speed up the aggregate process. Maybe the dataset is too small to tell, or I missed something ? Will adding Snappy compression help (not sure whether RCFiles are compressed or not) ? Thank you ! --047d7bf0d52e954b5704d720f23b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi there,

I've setup a virtua= l machine hosting Hive.
My use case is a Web traffic analyt= ics, hence most of requests are :

- ho= w many requests today ?
- how many request today, grouped by country ?
- = most requested urls ?
- average http server respo= nse time (5 minutes slots) ?

In other words, lets consider :
CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT= )
and=C2=A0

SELECT CO= UNT(*) FROM logs;
SELECT COUNT(*),orig_country FROM logs GR= OUP BY orig_country;
SELECT COUNT(*),url FROM logs BROUP BY url;
SELE= CT AVG(http_rt) FROM logs ...

2 questi= ons here :
- How to generate 5 minutes slots to make my ave= rages (in Postgresql, I used to generate_series() and JOIN) ? I wish I coul= d avoid doing multiple requests each with a 'WHERE date>... AND date= <...'. Maybe a mapper, mapping the date string to a aslot number ?<= /div>

- What is the best storage method pour this= table ? Since it's purpose is analytical, I thought columnar format wa= s the way to go. So I tried RCFILE buy the results are as follow for around= 1 million rows (quite small, I know) and are quite the opposite I was expe= cting :

Storage / query duration / disk table size<= /div>
TEXTFILE / 22 seconds / 250MB
RCFILE / 31 s= econds / 320 MB

=C2=A0I thought gettin= g values in columns would speed up the aggregate process. Maybe the dataset= is too small to tell, or I missed something ? Will adding Snappy compressi= on help (not sure whether RCFiles are compressed or not) ?

Thank you !



--047d7bf0d52e954b5704d720f23b--