hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cam Bazz <camb...@gmail.com>
Subject Re: why this query gives wrong results
Date Thu, 24 Feb 2011 06:51:25 GMT
Hello,

Here are the table descriptions. they only have the identifier, hits,
unqiques and date_day which is the partition

hive> describe selection_daily_hits;
OK
sel_sid int
hits    int
date_day        string

hive> describe selection_daily_uniques;
OK
sel_sid int
uniques int
date_day        string

hive> describe selection_daily_clicks;
OK
sel_sid int
clicks  int
date_day        string

i tried to query you have, which unfortunately returns:

FAILED: Parse Error: line 1:272 cannot recognize input 'ON' in
expression specification

In the query I wrote, I thought the on clause covered all three joins,
but I from your query how i should do it from your query, except, it
will return parse error...? but why?

best regards,
c,b,


On Wed, Feb 23, 2011 at 10:31 PM, Viral Bajaria <viral.bajaria@gmail.com> wrote:
>
> I have a few questions as follows:
> 1) what's the schema of all 3 tables ? Do these tables only have  (sel_sid,
> date_day) as the columns along with the facts that they represent or do they
> have more columns besides those 2 columns ?
> 2) why do you do a left outer join without an ON clause, not too sure if
> MySql syntax is like that but I normally don't prefer to do a JOIN without
> specifying anything in the ON clause. I just don't trust the result set and
> the query is not really readable.
> How about trying this query:
> SELECT
>    h.sel_sid, h.hits, u.uniques, if(c.clicks is not null, c.clicks, 0)
> FROM
>    selection_daily_hits h
>    left outer join selection_daily_uniques u ON (h.sel_sid = u.sel_sid AND
> h.date_day = u.date_day AND u.date_day = '20110211')
>    left outer join selection_daily_clicks c on ON (h.sel_sid = c.sel_sid AND
> h.date_day = c.date_day AND c.date_day = '20110211')
> WHERE h.date_day = '20110211';
> some notes about the query:
> - I restrict the uniques and clicks to 20110211 in the JOIN clause because
> in hive 0.5.0 if you put them in the WHERE clause the partitions don't get
> trimmed and it scans the entire table before limiting the data. If your data
> is not partitioned please go ahead and remove that restriction.
> - I join on the date_day columns to make sure the data is correct if the
> tables are not partitioned or the query plan causes table scans because
> there are chances you can see the same sel_sid on different days (this is an
> assumption)
> -Viral
> On Wed, Feb 23, 2011 at 3:16 AM, Cam Bazz <cambazz@gmail.com> wrote:
>>
>> Hello,
>>
>> I have three tables, one that counts hits, the other unique visits,
>> and the other clicks on that page:
>>
>> The query below will fail to produce correct results: (number of
>> uniques is wrong, always set to 8, same number for all)
>>
>> select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
>> c.clicks, 0) from selection_daily_hits h left outer join
>> selection_daily_uniques u left outer join selection_daily_clicks c on
>> (h.sel_sid = u.sel_sid and h.sel_sid = c.sel_sid and h.date_day =
>> '20110211' and u.date_day = '20110211' and c.date_day = '20110211');
>>
>> where the query below will work and provide correct results
>>
>> select h.sel_sid, h.hits, u.uniques, if(c.clicks is not null,
>> c.clicks, 0) from selection_daily_hits h left outer join
>> selection_daily_uniques u left outer join selection_daily_clicks c on
>> (h.sel_sid = u.sel_sid and c.sel_sid = h.sel_sid and h.date_day =
>> '20110211' and u.date_day = '20110211' and c.date_day = '20110211');
>>
>> the only difference is, on the non working query I have h.sel_sid =
>> c.sel_sid and in the working query I have c.sel_sid = h.sel_sid
>>
>> notice that while the first and second table will always have the same
>> number keys, the third table might not have some keys, hence those
>> lines are converted to 0.
>>
>> Best Regards,
>> -C.B.
>
>

Mime
View raw message