hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/Tutorial" by Ning Zhang
Date Thu, 15 Apr 2010 19:44:45 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=20&rev2=21

--------------------------------------------------

  }}}
  It is assumed that the array and map fields in the input.txt files are null fields for these
examples.
  
+ == Simple Query ==
+ For all the active users, one can use the query of the following form:
+ 
+ {{{
+     INSERT OVERWRITE TABLE user_active
+     SELECT user.*
+     FROM user
+     WHERE user.active = 1;
+ }}}
+ Note that unlike SQL, we always insert the results into a table. We will illustrate later
how the user can inspect these results and even dump them to a local file. You can also run
the following query on Hive CLI:
+ 
+ {{{
+     SELECT user.*
+     FROM user
+     WHERE user.active = 1;
+ }}}
+ This will be internally rewritten to some temporary file and displayed to the Hive client
side.
+ 
+ == Partition Based Query ==
+ What partitions to use in a query is determined automatically by the system on the basis
of where clause conditions on partition columns. For example, in order to get all the page_views
in the month of 03/2008 referred from domain xyz.com, one could write the following query:
+ 
+ {{{
+     INSERT OVERWRITE TABLE xyz_com_page_views
+     SELECT page_views.*
+     FROM page_views
+     WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
+           page_views.referrer_url like '%xyz.com';
+ }}}
+ Note that page_views.date is used here because the table (above) was defined with PARTITIONED
BY(date DATETIME, country STRING) ; if you name your partition something different, don't
expect .date to do what you think!
+ 
+ == Joins ==
+ In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would
need to join the page_view table and the user table on the userid column. This can be accomplished
with a join as shown in the following query:
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_users
+     SELECT pv.*, u.gender, u.age
+     FROM user u JOIN page_view pv ON (pv.userid = u.id)
+     WHERE pv.date = '2008-03-03';
+ }}}
+ In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or
FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved
or both sides preserved). For example, in order to do a full outer join in the query above,
the corresponding syntax would look like the following query:
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_users
+     SELECT pv.*, u.gender, u.age
+     FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
+     WHERE pv.date = '2008-03-03';
+ }}}
+ In order check the existence of a key in another table, the user can use LEFT SEMI JOIN
as illustrated by the following example.
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_users
+     SELECT u.*
+     FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
+     WHERE pv.date = '2008-03-03';
+ }}}
+ In order to join more than one tables, the user can use the following syntax:
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_friends
+     SELECT pv.*, u.gender, u.age, f.friends
+     FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)
+     WHERE pv.date = '2008-03-03';
+ }}}
+ Note that Hive only supports [[http://en.wikipedia.org/wiki/Join_(SQL)#Equi-join|equi-joins]].
Also it is best to put the largest table on the rightmost side of the join to get the best
performance.
+ 
+ == Aggregations ==
+ In order to count the number of distinct users by gender one could write the following query:
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_gender_sum
+     SELECT pv_users.gender, count (DISTINCT pv_users.userid)
+     FROM pv_users
+     GROUP BY pv_users.gender;
+ }}}
+ Multiple aggregations can be done at the same time, however, no two aggregations can have
different DISTINCT columns .e.g while the following is possible
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_gender_agg
+     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), sum(DISTINCT pv_users.userid)
+     FROM pv_users
+     GROUP BY pv_users.gender;
+ }}}
+ however, the following query is not allowed
+ 
+ {{{
+     INSERT OVERWRITE TABLE pv_gender_agg
+     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
+     FROM pv_users
+     GROUP BY pv_users.gender;
+ }}}
+ == Multi Table/File Inserts ==
+ The output of the aggregations or simple selects can be further sent into multiple tables
or even to hadoop dfs files (which can then be manipulated using hdfs utilities). e.g. if
along with the gender breakdown, one needed to find the breakdown of unique page views by
age, one could accomplish that with the following query:
+ 
+ {{{
+     FROM pv_users
+     INSERT OVERWRITE TABLE pv_gender_sum
+         SELECT pv_users.gender, count_distinct(pv_users.userid)
+         GROUP BY pv_users.gender
+ 
+     INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
+         SELECT pv_users.age, count_distinct(pv_users.userid)
+         GROUP BY pv_users.age;
+ }}}
+ The first insert clause sends the results of the first group by to a Hive table while the
second one sends the results to a hadoop dfs files.
+ 
+ == Dynamic-partition Insert ==
+ 
- In the above examples, the user has to know which partition to insert into and only one
partition can be inserted in one insert statement. If you want to load into multiple partitions,
you have to use multi-insert statement as illustrated below. 
+ In the previous examples, the user has to know which partition to insert into and only one
partition can be inserted in one insert statement. If you want to load into multiple partitions,
you have to use multi-insert statement as illustrated below. 
  {{{
      FROM page_view_stg pvs
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') 
-     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime,
pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
+            SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null,
pvs.ip WHERE pvs.country = 'US'
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') 
-     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime,
pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
+            SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null,
pvs.ip WHERE pvs.country = 'CA'
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') 
-     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime,
pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK'; 
  
+            SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null,
pvs.ip WHERE pvs.country = 'UK';    
  }}}
  
  In order to load data into all country partitions in a particular day, you have to add an
insert statement for each country in the input data. This is very inconvenient and inefficient
since you have to have the priori knowledge of the list of countries exist in the input data
and create the partitions beforehand. If the list changed for another day, you have to modify
your insert DML as well as the partition creation DDLs. 
  
- ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve this problem
by dynamically determining which partitions should be created and populated. This is a newly
added feature that is only available from version 0.6.0 (trunk now). In the dynamic partition
insert, the input column values are evaluated to determine which partition this row should
be inserted into. If that partition has not been created, it will create that partition automatically.
Using this feature you need only one insert statement to create and populate all necessary
partitions. Below is an example of loading data to all country partitions using one insert
statement:
+ ''Dynamic-partition insert'' (or multi-partition insert) is designed to solve this problem
by dynamically determining which partitions should be created and populated while scanning
the input table. This is a newly added feature that is only available from version 0.6.0 (trunk
now). In the dynamic partition insert, the input column values are evaluated to determine
which partition this row should be inserted into. If that partition has not been created,
it will create that partition automatically. Using this feature you need only one insert statement
to create and populate all necessary partitions. Below is an example of loading data to all
country partitions using one insert statement:
  
  {{{
- FROM page_view_stg pvs
+     FROM page_view_stg pvs
- INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
+     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) 
- SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
+            SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null,
pvs.ip, pvs.country
  }}}
  
  There are several syntactic differences from the multi-insert statement: 
@@ -350, +460 @@

    * Another situation we want to protect against dynamic partition insert is that the user
may accidentally specify all partitions to be dynamic partitions without specifying one static
partition, while the original intention is to just overwrite the sub-partitions of one root
partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent
the all-dynamic partition case. In the strict mode, you have to specify at least one static
partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false
to control whether to allow dynamic partition at all. The default value is false. 
    * Currently dynamic partition insert will not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true,
so it internally turns off the merge parameters. The reason is that if either of the merge
parameters is set to true, there will be a map reduce job for that particular partition to
merge all files into one. In dynamic partition insert, we do not know the number of partitions
at compile time thus no MapReduce job could be generated. There is a new JIRA HIVE-1307 filed
for this task. 
  
- == Simple Query ==
- For all the active users, one can use the query of the following form:
- 
- {{{
-     INSERT OVERWRITE TABLE user_active
-     SELECT user.*
-     FROM user
-     WHERE user.active = 1;
- }}}
- Note that unlike SQL, we always insert the results into a table. We will illustrate later
how the user can inspect these results and even dump them to a local file. You can also run
the following query on Hive CLI:
- 
- {{{
-     SELECT user.*
-     FROM user
-     WHERE user.active = 1;
- }}}
- This will be internally rewritten to some temporary file and displayed to the Hive client
side.
- 
- == Partition Based Query ==
- What partitions to use in a query is determined automatically by the system on the basis
of where clause conditions on partition columns. For example, in order to get all the page_views
in the month of 03/2008 referred from domain xyz.com, one could write the following query:
- 
- {{{
-     INSERT OVERWRITE TABLE xyz_com_page_views
-     SELECT page_views.*
-     FROM page_views
-     WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
-           page_views.referrer_url like '%xyz.com';
- }}}
- Note that page_views.date is used here because the table (above) was defined with PARTITIONED
BY(date DATETIME, country STRING) ; if you name your partition something different, don't
expect .date to do what you think!
- 
- == Joins ==
- In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would
need to join the page_view table and the user table on the userid column. This can be accomplished
with a join as shown in the following query:
- 
- {{{
-     INSERT OVERWRITE TABLE pv_users
-     SELECT pv.*, u.gender, u.age
-     FROM user u JOIN page_view pv ON (pv.userid = u.id)
-     WHERE pv.date = '2008-03-03';
- }}}
- In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or
FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved
or both sides preserved). For example, in order to do a full outer join in the query above,
the corresponding syntax would look like the following query:
- 
- {{{
-     INSERT OVERWRITE TABLE pv_users
-     SELECT pv.*, u.gender, u.age
-     FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
-     WHERE pv.date = '2008-03-03';
- }}}
- In order check the existence of a key in another table, the user can use LEFT SEMI JOIN
as illustrated by the following example.
- 
- {{{
-     INSERT OVERWRITE TABLE pv_users
-     SELECT u.*
-     FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
-     WHERE pv.date = '2008-03-03';
- }}}
- In order to join more than one tables, the user can use the following syntax:
- 
- {{{
-     INSERT OVERWRITE TABLE pv_friends
-     SELECT pv.*, u.gender, u.age, f.friends
-     FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)
-     WHERE pv.date = '2008-03-03';
- }}}
- Note that Hive only supports [[http://en.wikipedia.org/wiki/Join_(SQL)#Equi-join|equi-joins]].
Also it is best to put the largest table on the rightmost side of the join to get the best
performance.
- 
- == Aggregations ==
- In order to count the number of distinct users by gender one could write the following query:
- 
- {{{
-     INSERT OVERWRITE TABLE pv_gender_sum
-     SELECT pv_users.gender, count (DISTINCT pv_users.userid)
-     FROM pv_users
-     GROUP BY pv_users.gender;
- }}}
- Multiple aggregations can be done at the same time, however, no two aggregations can have
different DISTINCT columns .e.g while the following is possible
- 
- {{{
-     INSERT OVERWRITE TABLE pv_gender_agg
-     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), sum(DISTINCT pv_users.userid)
-     FROM pv_users
-     GROUP BY pv_users.gender;
- }}}
- however, the following query is not allowed
- 
- {{{
-     INSERT OVERWRITE TABLE pv_gender_agg
-     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
-     FROM pv_users
-     GROUP BY pv_users.gender;
- }}}
- == Multi Table/File Inserts ==
- The output of the aggregations or simple selects can be further sent into multiple tables
or even to hadoop dfs files (which can then be manipulated using hdfs utilities). e.g. if
along with the gender breakdown, one needed to find the breakdown of unique page views by
age, one could accomplish that with the following query:
- 
- {{{
-     FROM pv_users
-     INSERT OVERWRITE TABLE pv_gender_sum
-         SELECT pv_users.gender, count_distinct(pv_users.userid)
-         GROUP BY pv_users.gender
- 
-     INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
-         SELECT pv_users.age, count_distinct(pv_users.userid)
-         GROUP BY pv_users.age;
- }}}
- The first insert clause sends the results of the first group by to a Hive table while the
second one sends the results to a hadoop dfs files.
  
  == Inserting into local files ==
  In certain situations you would want to write the output into a local file so that you could
load it into an excel spreadsheet. This can be accomplished with the following command:

Mime
View raw message