Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7CE3B1023A for ; Thu, 31 Oct 2013 22:48:28 +0000 (UTC) Received: (qmail 67801 invoked by uid 500); 31 Oct 2013 22:48:26 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 67724 invoked by uid 500); 31 Oct 2013 22:48:26 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 67716 invoked by uid 99); 31 Oct 2013 22:48:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Oct 2013 22:48:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bruderman@radiumone.com designates 74.125.82.52 as permitted sender) Received: from [74.125.82.52] (HELO mail-wg0-f52.google.com) (74.125.82.52) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Oct 2013 22:48:20 +0000 Received: by mail-wg0-f52.google.com with SMTP id k14so3376620wgh.31 for ; Thu, 31 Oct 2013 15:48:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=radiumone.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=RABI5Rp1wyfus8DdLIwOc+LYT9c9z4kpor1OA5CpFks=; b=FQf/S1DpYcSKH0fRHxHVQek+udq+qQ8CnQxe0VfMzs2yabOWEJGM4BDVru+OA1Lz0C t4bGEBMI/OFJLKXa56OlIB0LPXe7FlSg+Z3QLJmP6B59bsxuPgdREtB8KCwcbfYsc/qF Zzl5oHMZ2FISLFT+gtGByu4KkUW35y74bltag= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=RABI5Rp1wyfus8DdLIwOc+LYT9c9z4kpor1OA5CpFks=; b=dsLktfWRyQn6pUGw3FQEewCmlxqG+capIMdhrR3LzSm2LmwkjRrQuWCTkzHZB6RJ4i lVcZl6NR8APGHXMIXC4dPpANfYdSuqfSjcek4HS+8nwIHST5kt/8D4eWTkXAZVjNPeKL 3p06mMAn0HvMhow/+Ax72EAzvUi9bNQONpLu3mXYlhiMSwgWIzLeCDwcWMzIH23BOjGf iGdbRI27zFX0rnetELfC0R3xDOh1Eb2rhSL3ijMFvuumpRgbN2WTvRDp+bqe4ce/Mddl ww8ujRrP1Mkhq6qrbKSF39P8O442i3d21PUaCzFA7JyzstVZR2l8SWrk5Q9/DcA5lmyh fNGg== X-Gm-Message-State: ALoCoQkyUb1hQktufX6b+i/r2FzJvXMWgXxBGEz8y4xm9jTJXV7otAbgepr39H6wrqCAdmW4lhF0 X-Received: by 10.180.85.226 with SMTP id k2mr49298wiz.31.1383259680297; Thu, 31 Oct 2013 15:48:00 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.156.10 with HTTP; Thu, 31 Oct 2013 15:47:40 -0700 (PDT) In-Reply-To: <1383259374.70165.YahooMailNeo@web162205.mail.bf1.yahoo.com> References: <1383258885.30915.YahooMailNeo@web162204.mail.bf1.yahoo.com> <1383259374.70165.YahooMailNeo@web162205.mail.bf1.yahoo.com> From: Brad Ruderman Date: Thu, 31 Oct 2013 15:47:40 -0700 Message-ID: Subject: Re: External Partition Table To: user@hive.apache.org, Raj Hadoop Content-Type: multipart/alternative; boundary=f46d04428230e070d604ea113d28 X-Virus-Checked: Checked by ClamAV on apache.org --f46d04428230e070d604ea113d28 Content-Type: text/plain; charset=ISO-8859-1 Personally from my limited understanding of your requirements, I would think partitioned by day would be fine. Perhaps use the "YYYYMMDD" method so partition for today would be 20131031 and tomorrow would be 20131101 Thanks, Brad On Thu, Oct 31, 2013 at 3:42 PM, Raj Hadoop wrote: > Hi Brad, > > Thanks for the quick response. > > I have about 10 GB file per day (web logs). And I am creating a > folder(partition) per each day. Is it something uncommon ? > > I do not know at this juncture what kind of queries I would be executing > upon on this table. But just wanted to know whether this is something > normal or not at all a normal thing. > > Thanks, > Raj > > > On Thursday, October 31, 2013 6:39 PM, Brad Ruderman < > bruderman@radiumone.com> wrote: > Wow that question won't be answerable. It all depends on the amount of > data per partition and the queries you are going to be executing on it, as > well as the structure of the data. In general in hive (depending on your > cluster size) you need to balance the number of files with the size, > smaller number of files is typically preferred but partitions will help > when date restricting. > > Thx, > Brad > > > On Thu, Oct 31, 2013 at 3:34 PM, Raj Hadoop wrote: > > Hi, > > I am planning for a Hive External Partition Table based on a date. > > Which one of the below yields a better performance or both have the same > performance? > > 1) Partition based on one folder per day > LIKE date INT > 2) Partition based on one folder per year / month / day ( So it has three > folders) > LIKE year INT, month INT, day INT > > Thanks, > Raj > > > > > --f46d04428230e070d604ea113d28 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Personally from my limited understanding of your requireme= nts, I would think partitioned by day would be fine. Perhaps use the "= YYYYMMDD" method so partition for today would be 20131031 and tomorrow= would be 20131101

Thanks,
Brad


On Thu, Oct 31, 2013 at 3:42 PM, Raj Ha= doop <hadoopraj@yahoo.com> wrote:
Hi Brad,

Thanks for the quick response.

I have about= 10 GB file per day (web logs). And I am creating a folder(partition) per e= ach day. Is it something uncommon ?

I do not know at this juncture what kind of queries I would be execut= ing upon on this table. But just wanted to know whether this is something n= ormal or not at all a normal thing.
Thanks,
Raj


On Thursday, October 31, 2013 6:39= PM, Brad Ruderman <bruderman= @radiumone.com> wrote:
Wow that question won't be answerable. It all depends on the a= mount of data per partition and the queries you are going to be executing o= n it, as well as the structure of the data. In general in hive (depending o= n your cluster size) you need to balance the number of files with the size,= smaller number of files is typically preferred but partitions will help wh= en date restricting.

Thx,
Brad


On Thu, Oct 31, 2013 at 3:34 PM, Raj= Hadoop <hadoopraj@yahoo.com> wrote:
Hi,

I am pl= anning for a Hive=A0External Partition Table based on a date.

Which one of the= below yields a better performance or both have the same performance?

1) Partition based on one folder per day
LIKE date IN= T
2)=A0Partition = based on one folder per year / month / day ( So it has three folders)=A0
LIKE=A0year INT, month INT, day INT

Thanks,<= /div>
Raj




<= /div>

--f46d04428230e070d604ea113d28--