hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/LanguageManual/Transform" by PaulYang
Date Fri, 08 Jan 2010 02:05:44 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/Transform" page has been changed by PaulYang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform?action=diff&rev1=14&rev2=15

--------------------------------------------------

  
  Users can also plug in their own custom mappers and reducers in the data stream by using
features natively supported in the Hive 2.0 language. e.g. in order to run a custom mapper
script - map_script - and a custom reducer script - reduce_script - the user can issue the
following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.
  
- Note that columns will be transformed to ''STRING'' and delimited by TAB before feeding
to the user script, and the standard output of the user script will be treated as TAB-separated
''STRING'' columns. User scripts can output debug information to standard error which will
be shown on the task detail page on hadoop.
+ By default, columns will be transformed to ''STRING'' and delimited by TAB before feeding
to the user script, and the standard output of the user script will be treated as TAB-separated
''STRING'' columns. User scripts can output debug information to standard error which will
be shown on the task detail page on hadoop. These defaults can be overridden with ''ROW FORMAT''...
  
  In the syntax, both ''MAP ...'' and ''REDUCE ...'' can be also written as ''SELECT TRANSFORM
( ... )''.  There are actually no difference between these three.
  Hive runs the reduce script in the reduce task (instead of the map task) because of the
''clusterBy''/''distributeBy''/''sortBy'' clause in the inner query.
@@ -20, +20 @@

  distributeBy: DISTRIBUTE BY colName (',' colName)*
  sortBy: SORT BY colName (ASC | DESC)? (',' colName (ASC | DESC)?)*
  
+ rowFormat
+   : ROW FORMAT
+     (DELIMITED [FIELDS TERMINATED BY char] 
+                [COLLECTION ITEMS TERMINATED BY char]
+                [MAP KEYS TERMINATED BY char]
+      | 
+      SERDE serde_name [WITH SERDEPROPERTIES 
+                             property_name=property_value, 
+                             property_name=property_value, ...])
+ 
+ outRowFormat : rowFormat
+ inRowFormat : rowFormat
+ outRecordReader : RECORDREADER className
+ 
  query:
    FROM (
      FROM src
      MAP expression (',' expression)*
+     (inRowFormat)?
      USING 'my_map_script'
+     (outRowFormat)? (outRecordReader)?
      ( AS colName (',' colName)* )?
      ( clusterBy? | distributeBy? sortBy? ) src_alias
    )
    REDUCE expression (',' expression)*
+     (inRowFormat)?
      USING 'my_reduce_script'
+     (outRowFormat)? (outRecordReader)?
      ( AS colName (',' colName)* )?
  
    FROM (
      FROM src
      SELECT TRANSFORM '(' expression (',' expression)* ')'
+     (inRowFormat)?
      USING 'my_map_script'
+     (outRowFormat)? (outRecordReader)?
      ( AS colName (',' colName)* )?
      ( clusterBy? | distributeBy? sortBy? ) src_alias
    )
    SELECT TRANSFORM '(' expression (',' expression)* ')'
+     (inRowFormat)? 
      USING 'my_reduce_script'
+     (outRowFormat)? (outRecordReader)?
      ( AS colName (',' colName)* )?
  }}}
  
- Example:
+ Example #1:
  {{{
    FROM (
      FROM pv_users
@@ -68, +90 @@

      AS date, count;
  }}}
  
+ Example #2
+ {{{
+   FROM (
+     FROM src
+     SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
+     USING '/bin/cat'
+     AS (tkey, tvalue) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
+     RECORDREADER 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader'
+   ) tmap
+   INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue
+ }}}
+ 
  == Schema-less Map-reduce Scripts ==
  If there is no ''AS'' clause after ''USING my_script'', Hive assumes the output of the script
contains 2 parts: key which is before the first tab, and value which is the rest after the
first tab.  Note that this is different from specifying ''AS key, value'' because in that
case value will only contains the portion between the first tab and the second tab if there
are multiple tabs.
  

Mime
View raw message