hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/HiveQL/GroupBy" by ZhengShao
Date Wed, 21 Jan 2009 23:10:56 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by ZhengShao:
http://wiki.apache.org/hadoop/Hive/HiveQL/GroupBy

New page:

== Group By Syntax ==
{{{

}}}

== Simple Examples ==
In order to count the number of distinct users by gender one could write the following query:
{{{
  INSERT OVERWRITE TABLE pv_gender_sum
  SELECT pv_users.gender, count (DISTINCT pv_users.userid)
  FROM pv_users
  GROUP BY pv_users.gender;
}}}

Multiple aggregations can be done at the same time, however, no two aggregations can have
different DISTINCT columns .e.g while the following is possible

{{{
  INSERT OVERWRITE TABLE pv_gender_agg
  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), sum(DISTINCT pv_users.userid)
  FROM pv_users
  GROUP BY pv_users.gender;
}}}

However, the following query is not allowed.  We don't allow multiple DISTINCT expressions
in the same query.

{{{
  INSERT OVERWRITE TABLE pv_gender_agg
  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
  FROM pv_users
  GROUP BY pv_users.gender;
}}}


== Advanced Features ==

=== Multi-Group-By Inserts ===
The output of the aggregations or simple selects can be further sent into multiple tables
or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). e.g. if
along with the gender breakdown, one needed to find the breakdown of unique page views by
age, one could accomplish that with the following query:

{{{
  FROM pv_users 
  INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count_distinct(pv_users.userid) 
    GROUP BY pv_users.gender 
  INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
    SELECT pv_users.age, count_distinct(pv_users.userid) 
    GROUP BY pv_users.age; 
}}}


=== Map-side Aggregation for Group By ===
''hive.map.aggr'' controls how we do aggregations.  The default is false.  If it is set to
true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.

{{{
  set hive.map.aggr=true;
  SELECT COUNT(1) FROM table2;
}}}

Mime
View raw message