Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 00FF6759E for ; Fri, 2 Sep 2011 17:19:32 +0000 (UTC) Received: (qmail 66815 invoked by uid 500); 2 Sep 2011 17:19:31 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 66717 invoked by uid 500); 2 Sep 2011 17:19:30 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 66701 invoked by uid 99); 2 Sep 2011 17:19:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Sep 2011 17:19:30 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of prvs=mgrover=21927c469@oanda.com designates 98.158.95.75 as permitted sender) Received: from [98.158.95.75] (HELO ironport-01.sms.scalar.ca) (98.158.95.75) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Sep 2011 17:19:24 +0000 Received: from unknown (HELO sms-zimbra-mta-02.sms.scalar.ca) ([192.168.32.56]) by ironport-01.sms.scalar.ca with ESMTP; 02 Sep 2011 13:19:02 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by sms-zimbra-mta-02.sms.scalar.ca (Postfix) with ESMTP id 7EEC187DB6 for ; Fri, 2 Sep 2011 13:19:02 -0400 (EDT) Received: from sms-zimbra-mta-02.sms.scalar.ca ([127.0.0.1]) by localhost (sms-zimbra-mta-02.sms.scalar.ca [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id of1UGlNSusvu; Fri, 2 Sep 2011 13:19:01 -0400 (EDT) Received: from sms-zimbra-message-store-03.sms.scalar.ca (unknown [172.17.19.202]) by sms-zimbra-mta-02.sms.scalar.ca (Postfix) with ESMTP id A326C87DB1; Fri, 2 Sep 2011 13:19:01 -0400 (EDT) Date: Fri, 2 Sep 2011 13:18:58 -0400 (EDT) From: Mark Grover To: user@hive.apache.org Cc: Baiju Devani , Bob Tiernay Message-ID: <1121375469.66659.1314983938438.JavaMail.root@sms-zimbra-message-store-03.sms.scalar.ca> In-Reply-To: <666974091.66608.1314983509708.JavaMail.root@sms-zimbra-message-store-03.sms.scalar.ca> Subject: Best practices for storing data on Hive MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.19.112] X-Mailer: Zimbra 6.0.7_GA_2473.RHEL5_64 (ZimbraWebClient - FF3.0 (Linux)/6.0.7_GA_2473.RHEL5_64) Hello folks, I am fairly new to Hive and am wondering if you could share some of the best practices for storing/querying data with Hive. Here is an example of the problem I am trying to solve. The traffic to our website is logged in files that contain information about clicks from various users. Simplified, the log file looks like: t_1, ip_1, userid_1 t_2, ip_2, userid_2 t_3, ip_3, userid_3 ... where t_i represents time of the click, ip_i represents ip address where the click originated from, and userid_i represents the user ID of the user. Since the clicks are logged on an ongoing basis, partitioning our Hive table by day seemed like the obvious choice. Every night we upload the data from the previous day into a new partition. However, we would also want the capability to find all log lines corresponding to a particular user. With our present partitioning scheme, all day partitions are searched for that user ID but this takes a long time. I am looking for ideas/suggestions/thoughts/comments on how to reduce this time. As a solution, I am thinking that perhaps we could have 2 independent tables, one which stores data partitioned by day and the other partitioned by userId. With the second table partitioned by userId, I will have to find some way of maintaining the partitions since Hive doesn't support appending of files. Also, this seems suboptimal, since we are doubling that the amount of data that we store. What do you folks think of this idea? Do you have any other suggestions on how we can approach this problem? What have other people in similar situations done? Please share. Thank you in advance! Mark