hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "SequenceFile" by Arun C Murthy
Date Wed, 16 Aug 2006 10:12:48 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by Arun C Murthy:

The comment on the change is:
First Cut

New page:
== Overview ==

SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in
MapReduce as input/output formats.
It is also worth noting the the ''output'' of the Map is always a SequenceFile.

The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting

There are 3 different !SequenceFile formats:
 1. Uncompressed key/value records.
 2. Record compressed key/value records - only 'values' are compressed here.
 3. Block compressed key/value records - both keys are values are collected in 'blocks' separately
and compressed.

The recommended way is to use the SequenceFile.createWriter methods to construct the 'preferred'
writer implementation.

The [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html
SequenceFile.Reader] acts as a bridge and can read any of the above SequenceFile formats.

== SequenceFile Formats ==

Essentially there are 3 different file formats for !SequenceFiles depending on whether ''compression''
and ''block compression'' are active.

However any of the above formats share a common ''header'' (which is used by the !SequenceFile.Reader
to return the appropriate key/value pairs). The next section summarises the header:
[[Anchor(SeqFileHeader)]]===== SequenceFile Common Header =====
 * version - A byte array: SEQ<version no.>
 * keyClassName - String
 * valueClassName - String
 * compression - A boolean which specifies if ''compression'' is turned on for keys/values
in this file.
 * blockCompression -  A boolean which specifies if ''block compression'' is turned on for
keys/values in this file.
 * sync - A sync marker to denote end of the header.

The formats for Uncompressed/!RecordCompressed Writers are very similar:
===== Uncompressed/RecordCompressed Writer Format =====
 * [#SeqFileHeader Header]
 * Record
   * Key
   * (Compressed?) Value
 * A sync-marker every 100bytes or so to help in seeking to a random point in the file and
then seeking to next ''record''.

The format for the !BlockCompressedWriter is as follows:
===== BlockCompressed Writer Format =====
 * [#SeqFileHeader Header]
 * Record ''Block''
   * !CompressedKeyLengthsBlockSize
   * !CompressedKeyLengthsBlock
   * !CompressedKeysBlockSize
   * !CompressedKeysBlock
   * !CompressedValueLengthsBlockSize
   * !CompressedValueLengthsBlock
   * !CompressedValuesBlockSize
   * !CompressedValuesBlock
   * A sync-marker to help in seeking to a random point in the file and then seeking to next
''record block''.

View raw message