hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng Shao (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HIVE-474) Support for distinct selection on two or more columns
Date Wed, 03 Jun 2009 20:44:07 GMT

    [ https://issues.apache.org/jira/browse/HIVE-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716057#action_12716057
] 

Zheng Shao edited comment on HIVE-474 at 6/3/09 1:42 PM:
---------------------------------------------------------

There are several approaches to solve this problem:

A1: separate group-by and join the results.

{code}
SELECT COALESCE(t1.key, t2.key), COALESCE(d_a, 0), , COALESCE(d_b, 0)
FROM
(SELECT key, count(distinct a) as d_a ...) t1
OUTER JOIN
(SELECT key, count(distinct b) as d_b ...) t2
ON t1.key = t2.key
{code}

A2: Take advantage of union type (HIVE-537).
See HIVE-537 for details.

A3: Take advantage of partitioned merge join:
Here is a different plan. It depends on partitioned merge join.
Also the 2 jobs have to have the same number of reducers.

{code}
Query:
  select a, count(distinct b), count(distinct c), sum(d)

Plan:
  Job1:
    Map side:
      Emit: distribution_key: a, sort_key: a, b, value: d
      Save a, c to temp_file1
    Reduce side:
      Group By:
        a, count(distinct b), sum(d)
    Output: temp_file2
  Job 2: Input: temp_file1
    Map side:
      Emit: distribution_key: a, sort_key: a, c, value: nothing
    Reduce side:
      Group By
        a, count(distinct c)
      Partitioned Merge Join with temp_file2 on a
        a, count(distinct b), sum(d), count(distinct c)
{code}


      was (Author: zshao):
    There are several approaches to solve this problem:

A1: separate group-by and join the results.

{code}
SELECT COALESCE(t1.key, t2.key), COALESCE(d_a, 0), , COALESCE(d_b, 0)
FROM
(SELECT key, count(distinct a) as d_a ...) t1
OUTER JOIN
(SELECT key, count(distinct b) as d_b ...) t2
ON t1.key = t2.key
{code}

A2: Take advantage of union type (HIVE-537).
See HIVE-537 for details.

A3: Take advantage of partitioned merge join:
Here is a different plan. It depends on partitioned merge join.
Also the 2 jobs have to have the same 

{code}
Query:
  select a, count(distinct b), count(distinct c), sum(d)

Plan:
  Job1:
    Map side:
      Emit: distribution_key: a, sort_key: a, b, value: d
      Save a, c to temp_file1
    Reduce side:
      Group By:
        a, count(distinct b), sum(d)
    Output: temp_file2
  Job 2: Input: temp_file1
    Map side:
      Emit: distribution_key: a, sort_key: a, c, value: nothing
    Reduce side:
      Group By
        a, count(distinct c)
      Partitioned Merge Join with temp_file2 on a
        a, count(distinct b), sum(d), count(distinct c)
{code}

  
> Support for distinct selection on two or more columns
> -----------------------------------------------------
>
>                 Key: HIVE-474
>                 URL: https://issues.apache.org/jira/browse/HIVE-474
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Alexis Rondeau
>
> The ability to select distinct several, individual columns as by example: 
> select count(distinct user), count(distinct session) from actions;   
> Currently returns the following failure: 
> FAILED: Error in semantic analysis: line 2:7 DISTINCT on Different Columns not Supported
user

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message