Mailing-List: contact hive-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hive-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com
 designates 209.85.214.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=PXEZTBtInkQStMfYk/NbeTDLGmcJQE72ND6qfrJiuhoYMnqAjmNfjYFpKQEklr6vHe
         QusKq0bW3Shel6QoEQbEMcBxlhj3uRFlfulixPgE+7qkLA+sMDtKYv/D9E/P87WOUGK9
         7GdSkbXC6agbZl3isMm+Uj8E1BAmqQBGLVd8Y=
MIME-Version: 1.0
In-Reply-To: <4F6B25AFFFCAFE44B6259A412D5F9B101C07A677@ExchMBX104.netflix.com>
References: <AANLkTikWLkB97SUro2iqbcGNSA3bzdihs30xrCM=fX7y@mail.gmail.com>
	<E5DB70BC34FE1747A84C72870A4D26860572325A@sc-mbx06.TheFacebook.com>
	<AANLkTinRc-awVTkoYkr_wYugj+Bf0Ot1Qp=jSDC88baV@mail.gmail.com>
	<AANLkTinGURJ8tMvRAwbpF_SCu86_03h11tgmErMSfQA5@mail.gmail.com>
	<AANLkTimai=87f4fYp+6Jj_ykZ0O=mSu3ss++o5JtpOke@mail.gmail.com>
	<4F6B25AFFFCAFE44B6259A412D5F9B101C07A677@ExchMBX104.netflix.com>
Date: Wed, 6 Oct 2010 22:56:05 -0400
Message-ID: <AANLkTi=Zb2nw1k4MqDOKyoySjaC56idbcAKG0vUG3cox@mail.gmail.com>
Subject: Re: RE: hive query doesn't seem to limit itself to partitions based
 on the WHERE clause
From: Edward Capriolo <edlinuxguru@gmail.com>
To: hive-user@hadoop.apache.org
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

On Wed, Oct 6, 2010 at 8:05 PM, Steven Wong <swong@netflix.com> wrote:
> What Hive version are you running? Try an =93explain extended=94 on your =
insert
> query and see if unneeded partitions are included.
>
>
>
> Pacific Standard Time (PST) is UTC-08:00, while Pacific Daylight Time (PD=
T)
> is UTC-07:00. To convert UTC to PDT, the condition should be:
>
> (HF.dt =3D '2010-09-29' AND HF.hr >=3D '07' ) OR (HF.dt =3D '2010-09-30' =
AND HF.hr
> < '07' )
>
> instead of:
>
> (HF.dt =3D '2010-09-29' AND HF.hr > '07' ) OR (HF.dt =3D '2010-09-30' AND=
 HF.hr
> <=3D '07' )
>
>
>
> Good luck on the days we spring forward or fall back. J/L
>
>
>
>
>
> From: Marc Limotte [mailto:mslimotte@gmail.com]
> Sent: Wednesday, October 06, 2010 11:12 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: RE: hive query doesn't seem to limit itself to partitions ba=
sed
> on the WHERE clause
>
>
>
> Thanks for the response, Edward.
>
> The source table (hourly_fact) is partitioned on dt (date) and hr (hour),
> and I've confirmed that they are both String fields (CREATE stmt is below=
).
>
> The hourly_fact table contains 'number of requests' for each hour by a fe=
w
> dimensions.=A0 The query is just trying to get a daily aggregation across
> those same dimensions.=A0 The only trick is that the hourly_fact table ha=
s dt
> and hour in UTC time.=A0 And the daily aggregation is being done for a PS=
T
> (pacific std) day, hence the 7 hour offset.
>
> CREATE TABLE IF NOT EXISTS hourly_fact (
> =A0 tagtype=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 STRING,
> =A0 country=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 STRING,
> =A0 company=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 INT,
> =A0 request_keyword=A0=A0=A0=A0=A0=A0 STRING,
> =A0 receiver_code=A0=A0=A0=A0=A0=A0=A0=A0 STRING,
> =A0 referrer_domain=A0=A0=A0=A0=A0=A0 STRING,
> =A0 num_requests=A0=A0=A0=A0=A0=A0=A0=A0=A0 INT,
> =A0 num_new_user_requests INT
> )
> PARTITIONED BY (dt STRING, hr STRING)
> ROW FORMAT DELIMITED
> STORED AS SEQUENCEFILE
> LOCATION "...";
>
> Marc
>
> On Tue, Oct 5, 2010 at 4:30 PM, Edward Capriolo <edlinuxguru@gmail.com>
> wrote:
>
> On Tue, Oct 5, 2010 at 3:36 PM, Marc Limotte <mslimotte@gmail.com> wrote:
>> Hi Namit,
>>
>> Hourly_fact is partitioned on dt and hr.
>>
>> Marc
>>
>> On Oct 3, 2010 10:00 PM, "Namit Jain" <njain@facebook.com> wrote:
>>> What is your table hourly_fact partitioned on ?
>>>
>>> ________________________________________
>>> From: Marc Limotte [mslimotte@gmail.com]
>>> Sent: Friday, October 01, 2010 2:10 PM
>>> To: hive-user@hadoop.apache.org
>>> Subject: hive query doesn't seem to limit itself to partitions based on
>>> the WHERE clause
>>>
>>> Hi,
>>>
>>> From looking at the hive log output, it seems that my job is accessing
>>> many more partitions than it needs to? For example, my query is somethi=
ng
>>> like:
>>>
>>> INSERT OVERWRITE TABLE daily_fact
>>> PARTITION (dt=3D'2010-09-29')
>>> SELECT
>>> 20100929 as stamp,
>>> tagtype,
>>> country,
>>> sum(num_requests) AS num_requests
>>> FROM
>>> hourly_fact HF
>>> WHERE
>>> (HF.dt =3D '2010-09-29' AND HF.hr > '07' )
>>> OR (HF.dt =3D '2010-09-30' AND HF.hr <=3D '07' )
>>> GROUP BY
>>> 20100929, tagtype, country
>>>
>>> Based on the WHERE clause, I would expect it to look only at partitions
>>> in
>>> the date range 2010-09-29 08:00:00 through 2010-09-30 07:00:00. But, th=
e
>>> log
>>> contains entries like:
>>>
>>> 10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file
>>> hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=3D2010-08-15/hr=
=3D10
>>>
>>> And many other hours outside my WHERE constraint. I assume this means
>>> that
>>> it's processing those directories. The answer still comes out right, bu=
t
>>> I'm
>>> concerned about the performance.
>>>
>>> Would appreciate some help understanding what this means and how to fix
>>> it.
>>>
>>> Thanks,
>>> Marc
>>>
>>>
>>
>
> Possibly you defined HF.hr <=3D '07' =A0as an int column and comparing it
> as a string is resulting in a full table scan. Can you explain the
> query?
>
>
Since you defined '07' as a string you are getting a lexicographic
comparison rather then a numeric one. That is why you are seeing more
columns then you expect. =3D will work the same but < > will not. You
can try to cast the query, or drop and add the partition using a
numeric type.