hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "SequenceFile" by DougCutting
Date Wed, 16 Aug 2006 18:51:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by DougCutting:
http://wiki.apache.org/lucene-hadoop/SequenceFile

The comment on the change is:
a few clarifications

------------------------------------------------------------------------------
  == Overview ==
  
  SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used
in MapReduce as input/output formats.
- It is also worth noting the the ''output'' of the Map is always a SequenceFile.
+ It is also worth noting that, internally, the temporary outputs of maps are stored using
SequenceFile.
  
  The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting
respectively.
  
@@ -38, +38 @@

   * Record
     * Key
     * (Compressed?) Value
-  * A sync-marker every 100bytes or so to help in seeking to a random point in the file and
then seeking to next ''record''.
+  * A sync-marker every few k bytes or so. 
+ 
+ The sync marker permits seeking to a random point in a file and then re-synchronizing input
with record boundaries.  This is required to be able to efficiently split large files for
MapReduce processing.
  
  [[BR]]
  The format for the !BlockCompressedWriter is as follows:

Mime
View raw message