hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/DeveloperGuide" by ZhengShao
Date Tue, 16 Dec 2008 22:52:17 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by ZhengShao:
http://wiki.apache.org/hadoop/Hive/DeveloperGuide

------------------------------------------------------------------------------
   * trunk/testutils (Deprecated)
  
  === SerDe ===
+ What is SerDe
+   * SerDe is a short name for Serializer and Deserializer.
+   * Hive uses SerDe (and FileFormat) to read from/write to tables.
+   * HDFS files --(InputFileFormat)--> <key, value> --(Deserializer)--> Row object
+   * Row object --(Serializer)--> <key, value> --(OutputFileFormat)--> HDFS files
+ 
+ Note that the "key" part is ignored when reading, and is always a constant when writing.
Basically the row object is only stored into the "value".
+ 
+ One principle of Hive is that Hive does not own the HDFS file format - Users should be able
to directly read the HDFS files in the Hive tables using other tools, or use other tools to
directly write to HDFS files that can be read by Hive through "CREATE EXTERNAL TABLE", or
can be loaded into Hive through "LOAD DATA INPATH" which just move the file into Hive table
directory.
+ 
+ Note that org.apache.hadoop.hive.serde is the deprecated old serde library. Please look
at org.apache.hadoop.hive.serde2 for the latest version.
+ 
+ Existing FileFormats and SerDe classes
+   * TextInputFormat/NoKeyTextOutputFormat: These 2 classes read/write data in plain text
file format.
+   * SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in
hadoop SequenceFile format.
+ 
+ Hive currently use these SerDe classes to serialize and deserialize data:
+   * MetadataTypedColumnsetSerDe: This serde is used to read/write delimited records like
CSV, tab-separated control-A separated records (sorry, quote is not supported yet.)
+   * ThriftSerDe: This serde is used to read/write thrift serialized objects.  The class
file for the Thrift object must be loaded first.
+   * DynamicSerDe: This serde also read/write thrift serialized objects, but it understands
thrift DDL so the schema of the object can be provided at runtime.  Also it supports a lot
of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which
writes data in delimited records).
+ 
+ How to write your own SerDe:
+   * In most cases, users want to write a Deserializer instead of a SerDe.
+   * For example, the RegexDeserializer will deserialize the data using the configuration
parameter 'regex', and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe).
Please see serde2/Deserializer.java for details.
+ 
  === MetaStore ===
+ 
  === Query Processor ===
  The following are the main components of the Hive Query Processor:
   * Parse and SemanticAnalysis (ql/parse) - This component contains the code for parsing
SQL, converting it into Abstract Syntax Trees, converting the Abstract Syntax Trees into Operator
Plans and finally converting the operator plans into a directed graph of tasks which are executed
by Driver.java.

Mime
View raw message