hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Sichi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-287) count distinct on multiple columns does not work
Date Thu, 08 Jul 2010 19:46:52 GMT

    [ https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886428#action_12886428
] 

John Sichi commented on HIVE-287:
---------------------------------

Regarding DISTINCT:  I agree with Arvind; this information should be provided to the UDAF
so that it can reject invocations that don't make sense.  Once this validation is passed,
the distinct elimination is still implemented generically inside of Hive (upstream of the
UDAF).

Regarding F(*):  let's discriminate three cases.

COUNT(*):  this really means COUNT(), not COUNT(x,y,z).  This is a very important distinction
to make from an optimizer perspective, because we want to be able to push down projection
to avoid I/O and other processing for columns whose values we will never look at.

SUM(*) and similar ones:  these we should disallow.

MY_UDAF(*), or MY_UDAF(t.*):  this is similar to Pradeep's case that came up recently on the
mailing list, and it needs to expand to MY_UDAF(x,y,z), not MY_UDAF().  I think the patch
is currently doing MY_UDAF(), which isn't what he wants.

My recommendation is that we commit Arvind's patch as is, then create a followup JIRA issue
to do what Pradeep is looking for (the expansion of * in the semantic analyzer) for both UDF
and UDAF, but with a special case for COUNT. UDAF authors will be able to decide whether or
not to reject the star syntax, since in the common case of a UDAF expecting a limited number
of parameters, the star won't make sense.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, HIVE-287-4.patch,
HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message