Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BB359F55E for ; Fri, 29 Mar 2013 18:37:37 +0000 (UTC) Received: (qmail 79644 invoked by uid 500); 29 Mar 2013 18:37:36 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 79571 invoked by uid 500); 29 Mar 2013 18:37:36 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 79562 invoked by uid 99); 29 Mar 2013 18:37:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Mar 2013 18:37:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dean.wampler@thinkbiganalytics.com designates 209.85.223.170 as permitted sender) Received: from [209.85.223.170] (HELO mail-ie0-f170.google.com) (209.85.223.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Mar 2013 18:37:30 +0000 Received: by mail-ie0-f170.google.com with SMTP id c11so817051ieb.15 for ; Fri, 29 Mar 2013 11:37:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:x-originating-ip:in-reply-to:references :date:message-id:subject:from:to:content-type:x-gm-message-state; bh=sV1m4o0gCdaHWWwamI0vsmZ+1w+1qDclCqz2HuHEEz0=; b=l072u6lNd2mqwMFGHmAGkrdzkiEjHhT/d97XXBn6+O84IOnQsZk9LW0Q1kyjbg9sk6 FZ/9no88okzl6WWTB7YLAGx59IO4BAWEmSM0ThuBNfUS2tCnoJe0FIHQLrflRReFPLAP 4aHdMGsRwHFWv/FVRVJPPNIeKFeBBL1g1/bqwFXsXBqEOklR52QRnTR2mRm40A3NBWbj pt2GrVsNtBq3X5Sbl+U8DSJ+19jVrsTTn84OFpPPvtzNNQP8SZQfNt2ARhmh5plm9+/I 14Uic57i2TE6AIRz646SudaLABMcopJQ/vjX7bcSodZ09trDXcFrOrxhs/o+t1wF/xxf p0NQ== MIME-Version: 1.0 X-Received: by 10.50.46.197 with SMTP id x5mr364427igm.7.1364582228286; Fri, 29 Mar 2013 11:37:08 -0700 (PDT) Received: by 10.64.250.104 with HTTP; Fri, 29 Mar 2013 11:37:08 -0700 (PDT) X-Originating-IP: [99.140.182.24] In-Reply-To: References: Date: Fri, 29 Mar 2013 13:37:08 -0500 Message-ID: Subject: Re: Noob question on creating tables From: Dean Wampler To: user@hive.apache.org Content-Type: multipart/alternative; boundary=14dae9340be9fbf7d904d9148efe X-Gm-Message-State: ALoCoQkWAFb7fJlH29SJshxJbWwTIId9a9pqy6XfOr4jiqAWVVfTWPJ871rNNo7Rj5yUhJJYz9bT X-Virus-Checked: Checked by ClamAV on apache.org --14dae9340be9fbf7d904d9148efe Content-Type: text/plain; charset=ISO-8859-1 That's a drawback of external tables, but it's actually not as difficult as it sounds. It's easy to write a nightly "cron" job that creates the partition for the next day (or a job per month...), if someone on your team has some bash experience. Other job scheduling tools should support this too. Here's an example. First, a hive script that uses parameters for the date (Hive v0.8 or newer): -- addlogpartition.hql ALTER TABLE log ADD IF NOT EXISTS PARTITION (year = ${YEAR}, month = ${MONTH}, day = ${DAY}); Then, run this bash script AFTER MIDNIGHT: #!/bin/bash YEAR=$(date +%Y) # returns the string "2013" today. MONTH=$(date +%m) # returns the string "03" today, with the leading zero. DAY=$(date +%d) # returns the string "29" today. Will prefix with 0 for dates < 10. # Assumes /path/to/2013/03/29 is the correct directory name: /path/to/hive -f /path/to/addlogpartition.hql -d YEAR=$YEAR -d MON=$MONTH -d DAY=$DAY (Of course, all the /path/to will be different...) So, be careful of how how "03" vs. "3" is handled in both the ALTER TABLE statement and the path. Off hand, I don't know if Hive will complain if you use 03 as an integer value in the ALTER TABLE statement. On Fri, Mar 29, 2013 at 1:16 PM, Mark wrote: > Thanks > > Does this mean I need to create a partition for each day manually? There > is no way to have infer that from my directory structure? > > On Mar 29, 2013, at 10:40 AM, Sanjay Subramanian < > Sanjay.Subramanian@wizecommerce.com> wrote: > > > Hi > > > > CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2 > > datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, DAY > > INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; > > > > > > ALTER table log_data ADD PARTITION (YEAR=2013 , MONTH=2, DAY=27) LOCATION > > '/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';" > > > > Hive will read gzip and bz2 files out of the box.(so suppose you had > > hourly log files in gzip format in your /YEAR/MONTH/DAY directory then it > > will be read) > > Snappy and LZO will need some jar installs and configs > > https://github.com/toddlipcon/hadoop-lzo > > > > https://code.google.com/p/snappy/ > > > > > > Note that for example - gzip format is not splittable..so huge gzip files > > without splits are not recommended as input to maps > > > > Hope this helps > > > > sanjay > > > > > > On 3/29/13 10:19 AM, "Mark" wrote: > > > >> We have existing log data in directories in the format of > YEAR/MONTH/DAY. > >> > >> - How can we create a table over this table without hive modifying > and/or > >> moving it? > >> - How can we tell Hive to partition this data so it knows about each day > >> of logs? > >> - Does hive out of the box work with reading compressed files? > >> > >> Thanks > > > > > > CONFIDENTIALITY NOTICE > > ====================== > > This email message and any attachments are for the exclusive use of the > intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution is > prohibited. If you are not the intended recipient, please contact the > sender by reply email and destroy all copies of the original message along > with any attachments, from your computer system. If you are the intended > recipient, please be advised that the content of this message is subject to > access, review and disclosure by the sender's Email System Administrator. > > > > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330 --14dae9340be9fbf7d904d9148efe Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable That's a drawback of external tables, but it's actually not as diff= icult as it sounds. It's easy to write a nightly "cron" job t= hat creates the partition for the next day (or a job per month...), if some= one on your team has some bash experience. Other job scheduling tools shoul= d support this too. Here's an example. First, a hive script that uses p= arameters for the date (Hive v0.8 or newer):

-- addlogpartition.hql
ALTER TABLE log ADD IF NOT = EXISTS PARTITION (year =3D ${YEAR}, month =3D ${MONTH}, day =3D ${DAY});

Then, run this bash script AFTER MIDNIGHT:

#!/bin/bash
YEAR=3D$(date +%Y) =A0 =A0 =A0 # returns t= he string "2013" today.
MONTH=3D$(date +%m) =A0 # retur= ns the string "03" today, with the leading zero.
DAY=3D= $(date +%d) =A0 =A0 =A0 =A0 =A0# returns the string "29" today. W= ill prefix with 0 for dates < 10.

# Assumes /path/to/2013/03/29 is the correct directory name:
= /path/to/hive -f /path/to/addlogpartition.hql -d YEAR=3D$YEAR -d MON=3D$MON= TH -d DAY=3D$DAY


(Of course, all th= e /path/to will be different...)

So, be careful of how how "03" vs. "3&qu= ot; is handled in both the ALTER TABLE=A0statement=A0and the path. Off hand= , I don't know if Hive will complain if you use 03 as an integer value = in the ALTER TABLE statement.


On Fri, Mar 29, 2013 at = 1:16 PM, Mark <static.void.dev@gmail.com> wrote:
=
Thanks

Does this mean I need to create a partition for each day manually? There is= no way to have infer that from my directory structure?

On Mar 29, 2013, at 10:40 AM,= Sanjay Subramanian <Sanjay.Subramanian@wizecommerce.com> wrote:

> Hi
>
> CREATE EXTERNAL TABLE IF NOT EXISTS log_data(col1 datatype1, col2
> datatype2, . . . colN datatypeN) PARTITIONED BY (YEAR INT, MONTH INT, = DAY
> INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
>
>
> ALTER table log_data ADD PARTITION (YEAR=3D2013 , MONTH=3D2, DAY=3D27)= LOCATION
> '/path/to/YEAR/MONTH/DAY/directory/ON/HDFS';"
>
> Hive will read gzip and bz2 files out of the box.(so suppose you had > hourly log files in gzip format in your /YEAR/MONTH/DAY directory then= it
> will be read)
> Snappy and LZO will need some jar installs and configs
> https://github.com/toddlipcon/hadoop-lzo
>
> https:= //code.google.com/p/snappy/
>
>
> Note that for example - gzip format is not splittable..so huge gzip fi= les
> without splits are not recommended as input to maps
>
> Hope this helps
>
> sanjay
>
>
> On 3/29/13 10:19 AM, "Mark" <static.void.dev@gmail.com> wrote:
>
>> We have existing log data in directories in the format of YEAR/MON= TH/DAY.
>>
>> - How can we create a table over this table without hive modifying= and/or
>> moving it?
>> - How can we tell Hive to partition this data so it knows about ea= ch day
>> of logs?
>> - Does hive out of the box work with reading compressed files?
>>
>> Thanks
>
>
> CONFIDENTIALITY NOTICE
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> This email message and any attachments are for the exclusive use of th= e intended recipient(s) and may contain confidential and privileged informa= tion. Any unauthorized review, use, disclosure or distribution is prohibite= d. If you are not the intended recipient, please contact the sender by repl= y email and destroy all copies of the original message along with any attac= hments, from your computer system. If you are the intended recipient, pleas= e be advised that the content of this message is subject to access, review = and disclosure by the sender's Email System Administrator.
>




--
= Dean Wampler, Ph.D.
+1-312-339-1330

--14dae9340be9fbf7d904d9148efe--