hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hive/Tutorial" by NeilConway
Date Thu, 02 Apr 2009 00:40:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by NeilConway:
http://wiki.apache.org/hadoop/Hive/Tutorial

------------------------------------------------------------------------------
  || BIGINT || count(1), count(DISTINCT col [, col]...)|| count(1) returns the number of members
in the group, whereas the count(DISTINCT col) gets the count of distinct values of the columns
in the group||
  || DOUBLE || sum(col), sum(DISTINCT col) || returns the sum of the elements in the group
or the sum of the distinct values of the column in the group||
  || DOUBLE || avg(col), avg(DISTINCT col) || returns the average of the elements in the group
or the average of the distinct values of the column in the group||
- || DOUBLE || min(col) || returns the minimum of the column in the group||
+ || DOUBLE || min(col) || returns the minimum value of the column in the group||
- || DOUBLE || max(col) || returns the maximum value of the column n the group||
+ || DOUBLE || max(col) || returns the maximum value of the column in the group||
  
  == Language capabilities ==
  Hive query language provides the basic SQL like operations. These operations work on tables
or partitions. These operations are:
@@ -223, +223 @@

              LINES TERMINATED BY '12' 
      STORED AS SEQUENCEFILE; 
  }}}
- In the example above, the table is bucketed(clustered by) userid and within each bucket
the data is sorted in the increasing order of viewTime. Such an organization allows the user
to do efficient sampling on the clustered column - in this case userid. The sorting property
allows internal operators to take advantage of the better-known data structure while evaluating
queries, also increasing efficiency.
+ In the example above, the table is bucketed (clustered by) userid and within each bucket
the data is sorted in the increasing order of viewTime. Such an organization allows the user
to do efficient sampling on the clustered column - in this case userid. The sorting property
allows internal operators to take advantage of the better-known data structure while evaluating
queries, also increasing efficiency.
  {{{    
      CREATE TABLE page_view(viewTime INT, userid BIGINT,
                      page_url STRING, referrer_url STRING, 
@@ -341, +341 @@

  (Note that page_views.date is used here because the table (above) was defined with PARTITIONED
BY(date DATETIME, country STRING) ; if you name your partition something different, don't
expect .date to do what you think!) 
  
  == Joins ==
- In order to get a demographic breakdown(by gender) of page_view of 2008-03-03 one would
need to join the page_view table and the user table on the userid column. This can be accomplished
with a join as shown in the following query: 
+ In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would
need to join the page_view table and the user table on the userid column. This can be accomplished
with a join as shown in the following query: 
  {{{     
      INSERT OVERWRITE TABLE pv_users 
      SELECT pv.*, u.gender, u.age 

Mime
View raw message