hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/GettingStarted" by ZhengShao
Date Mon, 07 Dec 2009 08:56:04 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GettingStarted" page has been changed by ZhengShao.
http://wiki.apache.org/hadoop/Hive/GettingStarted?action=diff&rev1=27&rev2=28

--------------------------------------------------

  This streams the data in the map phase through the script /bin/cat (like hadoop streaming).

  Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)
  
+ 
+ == Example Use Cases ==
+ 
+ === MovieLens User Ratings ===
+ First, create a table with tab-delimited text file format:
+ {{{
+ CREATE TABLE u_data (
+   userid INT,
+   movieid INT,
+   rating INT,
+   unixtime STRING)
+ ROW FORMAT DELIMITED
+ FIELDS TERMINATED BY '\t'
+ STORED AS TEXTFILE;
+ }}}
+ 
+ Then, download and extract the data files:
+ {{{
+ wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
+ tar xvzf ml-data.tar__0.gz
+ }}}
+  
+ And load it into the table that was just created:
+ {{{
+ LOAD DATA LOCAL INPATH 'ml-data/u.data'
+ OVERWRITE INTO TABLE u_data;
+ }}}
+ 
+ Count the number of rows in table u_data:
+ {{{
+ SELECT COUNT(1) FROM u_data;
+ }}}
+ 
+ Now we can do some complex data analysis on the table u_data:
+ 
+ Create weekday_mapper.py:
+ {{{
+ import sys
+ import datetime
+ 
+ for line in sys.stdin:
+   line = line.strip()
+   userid, movieid, rating, unixtime = line.split('\t')
+   weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
+   print '\t'.join([userid, movieid, rating, str(weekday)])
+ }}}
+ 
+ Use the mapper script:
+ {{{
+ CREATE TABLE u_data_new (
+   userid INT,
+   movieid INT,
+   rating INT,
+   weekday INT)
+ ROW FORMAT DELIMITED
+ FIELDS TERMINATED BY '\t';
+ 
+ INSERT OVERWRITE TABLE u_data_new
+ SELECT
+   TRANSFORM (userid, movieid, rating, unixtime)
+   USING 'python weekday_mapper.py'
+   AS (userid, movieid, rating, weekday)
+ FROM u_data;
+ 
+ SELECT weekday, COUNT(1)
+ FROM u_data_new
+ GROUP BY weekday;
+ }}}
+ 
+ === Apache Weblog Data ===
+ 
+ The format of Apache weblog is customizable, while most webmasters uses the default.
+ For default Apache weblog, we can create a table with the following command.
+ 
+ More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662
+ 
+ {{{
+ add jar ../build/contrib/hive_contrib.jar;
+ 
+ CREATE TABLE apachelog (
+   host STRING,
+   identity STRING,
+   user STRING,
+   time STRING,
+   request STRING,
+   status STRING,
+   size STRING,
+   referer STRING,
+   agent STRING)
+ ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
+ WITH SERDEPROPERTIES (
+   "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*)
(-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
+   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
+ )
+ STORED AS TEXTFILE;
+ }}}
+ 

Mime
View raw message