Return-Path: Delivered-To: apmail-hadoop-hive-user-archive@minotaur.apache.org Received: (qmail 10012 invoked from network); 7 Oct 2010 02:56:32 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Oct 2010 02:56:32 -0000 Received: (qmail 44638 invoked by uid 500); 7 Oct 2010 02:56:32 -0000 Delivered-To: apmail-hadoop-hive-user-archive@hadoop.apache.org Received: (qmail 44574 invoked by uid 500); 7 Oct 2010 02:56:31 -0000 Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hive-user@hadoop.apache.org Delivered-To: mailing list hive-user@hadoop.apache.org Received: (qmail 44566 invoked by uid 99); 7 Oct 2010 02:56:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Oct 2010 02:56:31 +0000 X-ASF-Spam-Status: No, hits=2.5 required=10.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bw0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Oct 2010 02:56:26 +0000 Received: by bwz14 with SMTP id 14so266656bwz.35 for ; Wed, 06 Oct 2010 19:56:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=8Yvo3P5Ix1Fq86HGgTBvj8iu3wUZdRxwcS97pc2pmLQ=; b=UOP+V7hcNgEN7CXUABtK2ODbDLpm7lrSyZF34x1oQA4dTL2TqwiGS9teJ1ZUZBOKWr KtL+l2q2ja3Syu/paVz7frTt02/LZl0OU9T6aSN+XVfu6DQcdelkXFkfiMMATyaHsylD 7fNRfbCPgI8RbSah7WQ2SMezXyy6tSNFOb930= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=PXEZTBtInkQStMfYk/NbeTDLGmcJQE72ND6qfrJiuhoYMnqAjmNfjYFpKQEklr6vHe QusKq0bW3Shel6QoEQbEMcBxlhj3uRFlfulixPgE+7qkLA+sMDtKYv/D9E/P87WOUGK9 7GdSkbXC6agbZl3isMm+Uj8E1BAmqQBGLVd8Y= MIME-Version: 1.0 Received: by 10.204.65.145 with SMTP id j17mr81956bki.209.1286420165132; Wed, 06 Oct 2010 19:56:05 -0700 (PDT) Received: by 10.204.62.84 with HTTP; Wed, 6 Oct 2010 19:56:05 -0700 (PDT) In-Reply-To: <4F6B25AFFFCAFE44B6259A412D5F9B101C07A677@ExchMBX104.netflix.com> References: <4F6B25AFFFCAFE44B6259A412D5F9B101C07A677@ExchMBX104.netflix.com> Date: Wed, 6 Oct 2010 22:56:05 -0400 Message-ID: Subject: Re: RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause From: Edward Capriolo To: hive-user@hadoop.apache.org Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On Wed, Oct 6, 2010 at 8:05 PM, Steven Wong wrote: > What Hive version are you running? Try an =93explain extended=94 on your = insert > query and see if unneeded partitions are included. > > > > Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PD= T) > is UTC-07:00. To convert UTC to PDT, the condition should be: > > (HF.dt =3D '2010-09-29' AND HF.hr >=3D '07' ) OR (HF.dt =3D '2010-09-30' = AND HF.hr > < '07' ) > > instead of: > > (HF.dt =3D '2010-09-29' AND HF.hr > '07' ) OR (HF.dt =3D '2010-09-30' AND= HF.hr > <=3D '07' ) > > > > Good luck on the days we spring forward or fall back. J/L > > > > > > From: Marc Limotte [mailto:mslimotte@gmail.com] > Sent: Wednesday, October 06, 2010 11:12 AM > To: hive-user@hadoop.apache.org > Subject: Re: RE: hive query doesn't seem to limit itself to partitions ba= sed > on the WHERE clause > > > > Thanks for the response, Edward. > > The source table (hourly_fact) is partitioned on dt (date) and hr (hour), > and I've confirmed that they are both String fields (CREATE stmt is below= ). > > The hourly_fact table contains 'number of requests' for each hour by a fe= w > dimensions.=A0 The query is just trying to get a daily aggregation across > those same dimensions.=A0 The only trick is that the hourly_fact table ha= s dt > and hour in UTC time.=A0 And the daily aggregation is being done for a PS= T > (pacific std) day, hence the 7 hour offset. > > CREATE TABLE IF NOT EXISTS hourly_fact ( > =A0 tagtype=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 STRING, > =A0 country=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 STRING, > =A0 company=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 INT, > =A0 request_keyword=A0=A0=A0=A0=A0=A0 STRING, > =A0 receiver_code=A0=A0=A0=A0=A0=A0=A0=A0 STRING, > =A0 referrer_domain=A0=A0=A0=A0=A0=A0 STRING, > =A0 num_requests=A0=A0=A0=A0=A0=A0=A0=A0=A0 INT, > =A0 num_new_user_requests INT > ) > PARTITIONED BY (dt STRING, hr STRING) > ROW FORMAT DELIMITED > STORED AS SEQUENCEFILE > LOCATION "..."; > > Marc > > On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo > wrote: > > On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte wrote: >> Hi Namit, >> >> Hourly_fact is partitioned on dt and hr. >> >> Marc >> >> On Oct 3, 2010 10:00 PM, "Namit Jain" wrote: >>> What is your table hourly_fact partitioned on ? >>> >>> ________________________________________ >>> From: Marc Limotte [mslimotte@gmail.com] >>> Sent: Friday, October 01, 2010 2:10 PM >>> To: hive-user@hadoop.apache.org >>> Subject: hive query doesn't seem to limit itself to partitions based on >>> the WHERE clause >>> >>> Hi, >>> >>> From looking at the hive log output, it seems that my job is accessing >>> many more partitions than it needs to? For example, my query is somethi= ng >>> like: >>> >>> INSERT OVERWRITE TABLE daily_fact >>> PARTITION (dt=3D'2010-09-29') >>> SELECT >>> 20100929 as stamp, >>> tagtype, >>> country, >>> sum(num_requests) AS num_requests >>> FROM >>> hourly_fact HF >>> WHERE >>> (HF.dt =3D '2010-09-29' AND HF.hr > '07' ) >>> OR (HF.dt =3D '2010-09-30' AND HF.hr <=3D '07' ) >>> GROUP BY >>> 20100929, tagtype, country >>> >>> Based on the WHERE clause, I would expect it to look only at partitions >>> in >>> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, th= e >>> log >>> contains entries like: >>> >>> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file >>> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=3D2010-08-15/hr= =3D10 >>> >>> And many other hours outside my WHERE constraint. I assume this means >>> that >>> it's processing those directories. The answer still comes out right, bu= t >>> I'm >>> concerned about the performance. >>> >>> Would appreciate some help understanding what this means and how to fix >>> it. >>> >>> Thanks, >>> Marc >>> >>> >> > > Possibly you defined HF.hr <=3D '07' =A0as an int column and comparing it > as a string is resulting in a full table scan. Can you explain the > query? > > Since you defined '07' as a string you are getting a lexicographic comparison rather then a numeric one. That is why you are seeing more columns then you expect. =3D will work the same but < > will not. You can try to cast the query, or drop and add the partition using a numeric type.