hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arvind Prabhakar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-287) count distinct on multiple columns does not work
Date Thu, 17 Jun 2010 22:34:25 GMT

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879983#action_12879983

Arvind Prabhakar commented on HIVE-287:

@John: Thanks for reviewing this change. I have some follow-up comments and suggestions:

bq. isDistinct: this doesn't actually modify the choice of evaluator implementation at all,
since the actual duplicate elimination takes place upstream of the UDAF invocation. So instead
of adding this parameter, can we instead add a new method supportsDistinct() on GenericUDAFEvaluator?

While the evaluation may be happening upstream, I was concerned that it does not exclude the
cases where this information is relevant to the function invocation itself. For example, the
implementation of {{count}} requires that if there is a valid argument list, it must be qualified
with {{DISTINCT}}.

bq. isAllColumns: COUNT is probably the only function which is ever even going to care about
this one. Couldn't we just use an empty array of TypeInfo to indicate all columns?

I had a similar idea, but after some consideration opted for a simpler design. I felt that
overloading arguments to indicate special cases might lead to confusion and eventual problem
when a use-case emerges that invalidates this assumption. 

I do agree with your point that it will be good to stay compatible if possible. One way to
do it would be as follows:

# Revert the {{GenericUDAFResolver}} to its previous state but make the interface deprecated
in favor of the abstract base class.
# Push the newly introduced method into {{AbstractGenericUDAFResolver}} implementation.
# Modify {{FunctionRegistry.getGenericUDAFEvaluator()}} method to test the resolver instance
to be type compatible with {{AbstractGenericUDAFResolver}} and if so, invoke the new method.
Otherwise revert to the old mechanism.

What do you think about this approach?

> count distinct on multiple columns does not work
> ------------------------------------------------
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
> The following query does not work:
> select count(distinct col1, col2) from Tbl

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message