hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Sichi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-287) count distinct on multiple columns does not work
Date Thu, 17 Jun 2010 23:15:26 GMT

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879994#action_12879994
] 

John Sichi commented on HIVE-287:
---------------------------------

For DISTINCT:   we can check the function invocation itself (during semantic analysis) by
calling supportsDistinct() immediately after instantiating the GenericUDAFEvaluator in SemanticAnalyzer.
 This allows strict validation to be performed.  Or make the method name checkDistinct and
allow the UDAF to throw the exception itself.  But I agree that in this case it would be cleaner
to extend the interface, so I'm fine if we go ahead with that in a non-breaking fashion.

For COUNT(*):  if you think about it, COUNT(*) really means "ignore all columns" not "count
all columns".  So I think an empty array actually makes a lot of sense here. Can you think
of a case where UDAF(*) even makes sense, where UDAF != COUNT?  If you don't have access to
any per-row data, what can you do other than count it?  I'd say we should actually disallow
* for anything but COUNT, per the SQL standard.

I like your approach to keeping compatibility via instanceof, so if the decision ends up being
to add the extra parameters, then we should definitely use that approach.  However, extension
points should always be interfaces (not abstract classes) to allow for stuff like dynamic
proxies.  So we would need to add a new interface GenericUDAFResolver2 (extends GenericUDAFResolver)
with the new method, and make AbstractGenericUDAFResolver implement both.

Interface evolution is never pretty, but there is an interface design pattern which avoids
this particular problem.  Imagine if originally we had defined a GenericUDAFResolverInput
class inside of Hive itself, with a method getParameters() returning TypeInfo [].  HIve would
instantiate this and pass an input object into getEvaluator, and the evaluator would call
input.getParameters().  This would have allowed us to add a boolean isDistinct() method to
GenericUDAFResolverInput without breaking anything (source or binary) and without needing
to add a new interface; old plugins would not know about isDistinct() so they wouldn't call
it, and new ones could.

I would argue that if we're going to go to the trouble of adding GenericUDAFResolver2, then
we should build the pattern above into it as well in case we need further evolution later
on.

p.s. I'm really glad you're working on this one...every few days I try a count(*) against
Hive accidentally and then kick myself.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message