hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Thusoo (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-503) improvement on distinct: distinguish distinct aggregate function from distinct
Date Fri, 22 May 2009 17:27:45 GMT

    [ https://issues.apache.org/jira/browse/HIVE-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712177#action_12712177
] 

Ashish Thusoo commented on HIVE-503:
------------------------------------

Actually we had talked about this approach a long time back but we were not sure that this
would be better than running 2 map/reduce jobs. The reason being that this approach leads
to a sort of mn amount of data where m is the number of distincts and n the number of rows
as opposed to a sort of m+n data if we do this with m map/reduce jobs. Granted that we also
scan the data mn times in the second approach as opposed to 1 time in the first approach but
we find in our cluster that scan bandwidth is not an issue (mostly because we store data compressed)
and the sort and memory used in the reducer or the mapper becomes the issue. I think this
does call for some experimentation to determine the value of m where one approach becomes
better than other..


> improvement on distinct: distinguish distinct aggregate function from distinct
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-503
>                 URL: https://issues.apache.org/jira/browse/HIVE-503
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Min Zhou
>
> h4.distinct
> # OK
> {code:sql}
> select 
>    distinct col
> from 
>   tbl
> {code}
> # FAILED
> {code:sql}
> select 
>    distinct  col1,
>    distinct  col2
> from 
>   tbl
> {code}
> h4.distinct aggregate function
> # OK
> {code:sql}
> select 
>    count(distinct col % 10)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>    count(distinct col1% 10)
>    count(distinct col1% 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>    count(distinct col1 % 10)
>    count(distinct col2 % 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>   sum(distinct col1 % 10),
>   count(distinct col2 % 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>   max(distinct substr(col1, 1, 10)),
>   count(distinct col2 % 9)
> from 
>   tbl
> {code}
> The keyword "distinct" ofen produce more than one results, so it's impossible removing
two different columns' duplicates in only one mapreduce job, so it failed.
> But the term "distinct aggregate function" with a form like aggregate_function(distinct
....),  is in connection with the term "all aggregate function",  it essentially is an aggregate
function. Only one result each aggregate function will produce,  it's very possible one mapreduce
job could deal with two or more different aggregate expression simultaneously.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message