hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Min Zhou (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-503) improvement on distinct: distinguish distinct aggregate function from distinct
Date Fri, 22 May 2009 08:10:45 GMT

     [ https://issues.apache.org/jira/browse/HIVE-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Min Zhou updated HIVE-503:
--------------------------

    Description: 
h4.distinct
# OK
{code:sql}
select 
   distinct col
from 
  tbl
{code}
# FAILED
{code:sql}
select 
   distinct  col1,
   distinct  col2
from 
  tbl
{code}

h4.distinct aggregate function
# OK
{code:sql}
select 
   count(distinct col % 10)
from 
  tbl
{code}
# OK
{code:sql}
select 
   count(distinct col1% 10)
   count(distinct col1% 9)
from 
  tbl
{code}
# OK
{code:sql}
select 
   count(distinct col1 % 10)
   count(distinct col2 % 9)
from 
  tbl
{code}
# OK
{code:sql}
select 
  sum(distinct col1 % 10),
  count(distinct col2 % 9)
from 
  tbl
{code}
# OK
{code:sql}
select 
  max(distinct substr(col1, 1, 10)),
  count(distinct col2 % 9)
from 
  tbl
{code}

The keyword "distinct" ofen produce more than one results, so it's impossible removing two
different columns' duplicates in only one mapreduce job, so it failed.

But the term "distinct aggregate function" with a form like aggregate_function(distinct ....),
 is in connection with the term "all aggregate function",  it essentially is an aggregate
function. Only one result each aggregate function will produce,  it's very possible one mapreduce
job could deal with two or more different aggregate expression simultaneously.


  was:
h4.distinct
# OK
{code:sql}
select 
   col
from 
  tbl
{code}
# FAILED
{code:sql}
select 
   col1,
   col2
from 
  tbl
{code}

h4.distinct aggregate function
# OK
{code:sql}
select 
   count(distinct col % 10)
from 
  tbl
{code}
# OK
{code:sql}
select 
   count(distinct col1% 10)
   count(distinct col1% 9)
from 
  tbl
{code}
# OK
{code:sql}
select 
   count(distinct col1 % 10)
   count(distinct col2 % 9)
from 
  tbl
{code}
# OK
{code:sql}
select 
  sum(distinct col1 % 10),
  count(distinct col2 % 9)
from 
  tbl
{code}
# OK
{code:sql}
select 
  max(distinct substr(col1, 1, 10)),
  count(distinct col2 % 9)
from 
  tbl
{code}

The keyword "distinct" ofen produce more than one results, so it's impossible removing two
different columns' duplicates in only one mapreduce job, so it failed.

But the term "distinct aggregate function" with a form like aggregate_function(distinct ....),
 is in connection with the term "all aggregate function",  it essentially is an aggregate
function. Only one result each aggregate function will produce,  it's very possible one mapreduce
job could deal with two or more different aggregate expression simultaneously.



> improvement on distinct: distinguish distinct aggregate function from distinct
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-503
>                 URL: https://issues.apache.org/jira/browse/HIVE-503
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Min Zhou
>
> h4.distinct
> # OK
> {code:sql}
> select 
>    distinct col
> from 
>   tbl
> {code}
> # FAILED
> {code:sql}
> select 
>    distinct  col1,
>    distinct  col2
> from 
>   tbl
> {code}
> h4.distinct aggregate function
> # OK
> {code:sql}
> select 
>    count(distinct col % 10)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>    count(distinct col1% 10)
>    count(distinct col1% 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>    count(distinct col1 % 10)
>    count(distinct col2 % 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>   sum(distinct col1 % 10),
>   count(distinct col2 % 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>   max(distinct substr(col1, 1, 10)),
>   count(distinct col2 % 9)
> from 
>   tbl
> {code}
> The keyword "distinct" ofen produce more than one results, so it's impossible removing
two different columns' duplicates in only one mapreduce job, so it failed.
> But the term "distinct aggregate function" with a form like aggregate_function(distinct
....),  is in connection with the term "all aggregate function",  it essentially is an aggregate
function. Only one result each aggregate function will produce,  it's very possible one mapreduce
job could deal with two or more different aggregate expression simultaneously.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message