hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arvind Prabhakar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-287) count distinct on multiple columns does not work
Date Thu, 08 Jul 2010 15:02:55 GMT

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886339#action_12886339
] 

Arvind Prabhakar commented on HIVE-287:
---------------------------------------

@Zheng: Welcome to the party.

bq. Why do we put the DISTINCT in the information? DISTINCT is currently done by the framework,
instead of individual UDAF. This is good because the logic of removing duplicates are common
for all UDAFs. We do support SUM(DISTINCT val).

Providing the information in the parameter specification is not the same as enforcing its
interpretation. This is provided primarily to ensure that UDAFs that rely on this information
can make appropriate decisions. For example, we wanted to disallow the invocation {{COUNT(
EXPR1, EXPR2 ...)}} in favor of {{COUNT(*DISTINCT* EXPR1, EXPR2 ...)}}. Without this information,
the count UDAF will not be able to enforce the later syntax.

bq. Why do we special-case ""? It seems to me that "" is just a short-cut. Hive already supports
regex-based multi-column specification, so that we can say `abc.*` for all columns with name
starting with abc. The compiler should just expand * and give all the columns to the UDAF.

If you wish to use \* as a regular expression, you would have to quote it as a string - {{COUNT('\*')}}.
This is different from the invocation as specified in SQL which treats \* as a terminal symbol.
So if it is OK to deviate from the standard representation, the user can easily use the quoted
string representation to achieve the effect similar to {{COUNT(col1, col2 ..)}}. The semantics
of this should be more like {{COUNT(DISTINCT EXPR1, EXPR2 ...)}} as opposed to {{COUNT(\*)}}.

bq. Since COUNT(\*) is a special-case in the SQL standard (COUNT(\*) is different from COUNT(col)
even if the table has a single column col), I think we should just special-case that and replace
that with count(1) at some place.

Are you suggesting that we allow the grammar to express {{COUNT(\*)}} syntax, but in the lexical
analysis stage turn it into a {{COUNT(1)}}? I can see how that may work - but personally I
am not a fan of such an approach. 

> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch,
HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message