hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "SequenceFile" by Arun C Murthy
Date Thu, 17 Aug 2006 05:13:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by Arun C Murthy:

  == SequenceFile Formats ==
+ This section describes the format for the latest ''''version 4'''' of !SequenceFiles.
  Essentially there are 3 different file formats for !SequenceFiles depending on whether ''compression''
and ''block compression'' are active.
  However all of the above formats share a common ''header'' (which is used by the !SequenceFile.Reader
to return the appropriate key/value pairs). The next section summarises the header:
  ===== SequenceFile Common Header =====
-  * version - A byte array: SEQ<version no.>
+  * version - A byte array: 3 bytes of magic header ''''SEQ'''', followed by 1 byte of actual
version no. (e.g. SEQ4)
   * keyClassName - String
   * valueClassName - String
   * compression - A boolean which specifies if ''compression'' is turned on for keys/values
in this file.
   * blockCompression -  A boolean which specifies if ''block compression'' is turned on for
keys/values in this file.
   * sync - A sync marker to denote end of the header.
+ All strings are serialized using [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/Text.html#writeString(java.io.DataOutput,%20java.lang.String)
Text.writeString] api.
+ [[BR]]
  The formats for Uncompressed/!RecordCompressed Writers are very similar:
  ===== Uncompressed/RecordCompressed Writer Format =====
   * [#SeqFileHeader Header]
   * Record
+    * Record length
+    * Key length
     * Key
     * (Compressed?) Value
   * A sync-marker every few k bytes or so. 
@@ -57, +64 @@

     * !CompressedValuesBlockSize
     * !CompressedValuesBlock
+  The compressed blocks of ''key lengths'' and ''value lengths'' consist of the actual lengths
of individual keys/values encoded in ZeroCompressedInteger format.

View raw message