lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gsing...@apache.org
Subject svn commit: r480702 - /lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml
Date Wed, 29 Nov 2006 20:10:46 GMT
Author: gsingers
Date: Wed Nov 29 12:10:44 2006
New Revision: 480702

URL: http://svn.apache.org/viewvc?view=rev&rev=480702
Log:
Readded the lock-less commits documentation that was accidentally lost.

Modified:
    lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml

Modified: lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml
URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml?view=diff&rev=480702&r1=480701&r2=480702
==============================================================================
--- lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml (original)
+++ lucene/java/trunk/src/site/src/documentation/content/xdocs/fileformats.xml Wed Nov 29 12:10:44 2006
@@ -1,38 +1,39 @@
 <?xml version="1.0"?>
 
 <document>
-	<header>
+    <header>
         <title>
-Apache Lucene - Index File Formats
-		</title>
-	</header>
-  <properties>
-   
-   <authors>
-    <person email="cutting@apache.org" name="Doug Cutting"/>
-   </authors>
-  </properties>
+            Apache Lucene - Index File Formats
+        </title>
+    </header>
+
+    <properties>
+        <authors>
+            <person email="cutting@apache.org" name="Doug Cutting"/>
+        </authors>
+    </properties>
 
     <body>
-        <section id="Index File Formats">
-            <title>Index File Formats</title>
+        <section id="Index File Formats"><title>Index File Formats</title>
+
             <p>
                 This document defines the index file formats used
-                in Lucene version 2.0.  If you are using a different
-		version of Lucene, please consult the copy of
-		<code>docs/fileformats.html</code> that was distributed
-		with the version you are using.
+                in Lucene version 2.1. If you are using a different
+                version of Lucene, please consult the copy of
+                <code>docs/fileformats.html</code>
+                that was distributed
+                with the version you are using.
             </p>
 
             <p>
                 Apache Lucene is written in Java, but several
                 efforts are underway to write
                 <a href="http://wiki.apache.org/jakarta-lucene/LuceneImplementations">versions
-                of Lucene in other programming
+                    of Lucene in other programming
                 languages</a>.  If these versions are to remain compatible with Apache
                 Lucene, then a language-independent definition of the Lucene index
                 format is required.  This document thus attempts to provide a
-                complete and independent definition of the Apache Lucene 1.4 file
+                complete and independent definition of the Apache Lucene 2.1 file
                 formats.
             </p>
 
@@ -47,10 +48,22 @@
                 describing how file formats have changed from prior versions.
             </p>
 
+            <p>
+                In version 2.1, the file format was changed to allow
+                lock-less commits (ie, no more commit lock). The
+                change is fully backwards compatible: you can open a
+                pre-2.1 index for searching or adding/deleting of
+                docs. When the new segments file is saved
+                (committed), it will be written in the new file format
+                (meaning no specific "upgrade" process is needed).
+                But note that once a commit has occurred, pre-2.1
+                Lucene will not be able to read the index.
+            </p>
+
         </section>
 
-        <section id="Definitions">
-            <title>Definitions</title>
+        <section id="Definitions"><title>Definitions</title>
+
             <p>
                 The fundamental concepts in Lucene are index,
                 document, field and term.
@@ -86,8 +99,8 @@
                 within the field.
             </p>
 
-            <section id="Inverted Indexing">
-                <title>Inverted Indexing</title>
+            <section id="Inverted Indexing"><title>Inverted Indexing</title>
+
                 <p>
                     The index stores statistics about terms in order
                     to make term-based search more efficient.  Lucene's
@@ -114,18 +127,20 @@
                 <p>See the <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html">Field</a> java docs for more information on Fields.</p>
             </section>
 
-            <section id="Segments">
-                <title>Segments</title>
+            <section id="Segments"><title>Segments</title>
+
                 <p>
-                    Lucene indexes may be composed of multiple sub-indexes, or<i>
-                        segments</i>. Each segment is a fully independent index, which could be searched
+                    Lucene indexes may be composed of multiple sub-indexes, or
+                    <i>segments</i>. Each segment is a fully independent index, which could be searched
                     separately. Indexes evolve by:
                 </p>
 
                 <ol>
-                    <li><p>Creating new segments for newly added documents.</p>
+                    <li>
+                        <p>Creating new segments for newly added documents.</p>
                     </li>
-                    <li><p>Merging existing segments.</p>
+                    <li>
+                        <p>Merging existing segments.</p>
                     </li>
                 </ol>
 
@@ -135,8 +150,8 @@
                 </p>
             </section>
 
-            <section id="Document Numbers">
-                <title>Document Numbers</title>
+            <section id="Document Numbers"><title>Document Numbers</title>
+
                 <p>
                     Internally, Lucene refers to documents by an integer <i>document
                         number</i>. The first document added to an index is numbered zero, and each
@@ -149,7 +164,7 @@
 
                 <p>
                     Note that a document's number may change, so caution should be taken
-                    when storing these numbers outside of Lucene.  In particular, numbers may
+                    when storing these numbers outside of Lucene. In particular, numbers may
                     change in the following situations:
                 </p>
 
@@ -176,9 +191,9 @@
                     <li>
                         <p>
                             When documents are deleted, gaps are created
-                            in the numbering.  These are eventually removed as the index evolves
-                            through merging.  Deleted documents are dropped when segments are
-                            merged.  A freshly-merged segment thus has no gaps in its numbering.
+                            in the numbering. These are eventually removed as the index evolves
+                            through merging. Deleted documents are dropped when segments are
+                            merged. A freshly-merged segment thus has no gaps in its numbering.
                         </p>
                     </li>
                 </ul>
@@ -187,59 +202,68 @@
 
         </section>
 
-        <section id="Overview">
-            <title>Overview</title>
+        <section id="Overview"><title>Overview</title>
+
             <p>
                 Each segment index maintains the following:
             </p>
             <ul>
-                <li><p>Field names.  This
+                <li>
+                    <p>Field names. This
                         contains the set of field names used in the index.
 
                     </p>
                 </li>
-                <li><p>Stored Field
-                        values.  This contains, for each document, a list of attribute-value
-                        pairs, where the attributes are field names.  These are used to
+                <li>
+                    <p>Stored Field
+                        values. This contains, for each document, a list of attribute-value
+                        pairs, where the attributes are field names. These are used to
                         store auxiliary information about the document, such as its title,
                         url, or an identifier to access a
                         database. The set of stored fields are what is returned for each hit
-                        when searching.  This is keyed by document number.
+                        when searching. This is keyed by document number.
                     </p>
                 </li>
-                <li><p>Term dictionary.
+                <li>
+                    <p>Term dictionary.
                         A dictionary containing all of the terms used in all of the indexed
-                        fields of all of the documents.  The dictionary also contains the
+                        fields of all of the documents. The dictionary also contains the
                         number of documents which contain the term, and pointers to the
                         term's frequency and proximity data.
                     </p>
                 </li>
 
-                <li><p>Term Frequency
-                        data.  For each term in the dictionary, the numbers of all the
+                <li>
+                    <p>Term Frequency
+                        data. For each term in the dictionary, the numbers of all the
                         documents that contain that term, and the frequency of the term in
                         that document.
                     </p>
                 </li>
 
-                <li><p>Term Proximity
-                        data.  For each term in the dictionary, the positions that the term
+                <li>
+                    <p>Term Proximity
+                        data. For each term in the dictionary, the positions that the term
                         occurs in each document.
                     </p>
                 </li>
 
-                <li><p>Normalization
-                        factors.  For each field in each document, a value is stored that is
+                <li>
+                    <p>Normalization
+                        factors. For each field in each document, a value is stored that is
                         multiplied into the score for hits on that field.
                     </p>
                 </li>
-                <li><p>Term Vectors.  For each field in each document, the term vector
-                       (sometimes called document vector) may be stored.  A term vector consists
-                       of term text and term frequency.  To add Term Vectors to your index see the
-                    <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html">Field</a> constructors
+                <li>
+                    <p>Term Vectors. For each field in each document, the term vector
+                        (sometimes called document vector) may be stored. A term vector consists
+                        of term text and term frequency. To add Term Vectors to your index see the
+                        <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html">Field</a>
+                        constructors
                     </p>
-                </li>              
-                <li><p>Deleted documents.
+                </li>
+                <li>
+                    <p>Deleted documents.
                         An optional file indicating which documents are deleted.
                     </p>
                 </li>
@@ -249,11 +273,11 @@
             </p>
         </section>
 
-        <section id="File Naming">
-            <title>File Naming</title>
+        <section id="File Naming"><title>File Naming</title>
+
             <p>
                 All files belonging to a segment have the same name with varying
-                extensions.  The extensions correspond to the different file formats
+                extensions. The extensions correspond to the different file formats
                 described below. When using the Compound File format (default in 1.4 and greater) these files are
                 collapsed into a single .cfs file (see below for details)
             </p>
@@ -264,23 +288,35 @@
                 required.
             </p>
 
+            <p>
+                As of version 2.1 (lock-less commits), file names are
+                never re-used (there is one exception, "segments.gen",
+                see below). That is, when any file is saved to the
+                Directory it is given a never before used filename.
+                This is achieved using a simple generations approach.
+                For example, the first segments file is segments_1,
+                then segments_2, etc. The generation is a sequential
+                long integer represented in alpha-numeric (base 36)
+                form.
+            </p>
+
         </section>
 
-        <section id="Primitive Types">
-            <title>Primitive Types</title>
-            <section id="Byte">
-                <title>Byte</title>
+        <section id="Primitive Types"><title>Primitive Types</title>
+
+            <section id="Byte"><title>Byte</title>
+
                 <p>
                     The most primitive type
-                    is an eight-bit byte.  Files are accessed as sequences of bytes.  All
+                    is an eight-bit byte. Files are accessed as sequences of bytes. All
                     other data types are defined as sequences
                     of bytes, so file formats are byte-order independent.
                 </p>
 
             </section>
 
-            <section id="UInt32">
-                <title>UInt32</title>
+            <section id="UInt32"><title>UInt32</title>
+
                 <p>
                     32-bit unsigned integers are written as four
                     bytes, high-order bytes first.
@@ -291,8 +327,8 @@
 
             </section>
 
-            <section id="Uint64">
-                <title>Uint64</title>
+            <section id="Uint64"><title>Uint64</title>
+
                 <p>
                     64-bit unsigned integers are written as eight
                     bytes, high-order bytes first.
@@ -303,39 +339,45 @@
 
             </section>
 
-            <section id="VInt">
-                <title>VInt</title>
+            <section id="VInt"><title>VInt</title>
+
                 <p>
                     A variable-length format for positive integers is
                     defined where the high-order bit of each byte indicates whether more
-                    bytes remain to be read.  The low-order seven bits are appended as
+                    bytes remain to be read. The low-order seven bits are appended as
                     increasingly more significant bits in the resulting integer value.
                     Thus values from zero to 127 may be stored in a single byte, values
                     from 128 to 16,383 may be stored in two bytes, and so on.
                 </p>
 
-                <p><b>VInt Encoding Example</b></p>
+                <p>
+                    <b>VInt Encoding Example</b>
+                </p>
 
                 <table width="100%" border="0" cellpadding="4" cellspacing="0">
-                    <col width="64*" />
-                    <col width="64*" />
-                    <col width="64*" />
-                    <col width="64*" />
+                    <col width="64*"/>
+                    <col width="64*"/>
+                    <col width="64*"/>
+                    <col width="64*"/>
                     <tr valign="TOP">
                         <td width="25%">
-                            <p align="RIGHT"><b>Value</b>
+                            <p align="RIGHT">
+                                <b>Value</b>
                             </p>
                         </td>
                         <td width="25%">
-                            <p align="RIGHT"><b>First byte</b>
+                            <p align="RIGHT">
+                                <b>First byte</b>
                             </p>
                         </td>
                         <td width="25%">
-                            <p align="RIGHT"><b>Second byte</b>
+                            <p align="RIGHT">
+                                <b>Second byte</b>
                             </p>
                         </td>
                         <td width="25%">
-                            <p align="RIGHT"><b>Third byte</b>
+                            <p align="RIGHT">
+                                <b>Third byte</b>
                             </p>
                         </td>
                     </tr>
@@ -352,13 +394,15 @@
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -376,13 +420,15 @@
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -400,13 +446,15 @@
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -418,19 +466,22 @@
                         </td>
                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -448,13 +499,15 @@
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -478,7 +531,8 @@
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -502,7 +556,8 @@
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -526,7 +581,8 @@
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -538,19 +594,22 @@
                         </td>
                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: 0.11cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.07cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
                         <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -574,7 +633,8 @@
                         </td>
                         <td width="25%" sdnum="1033;0;00000000">
                             <p align="RIGHT" style="margin-left: -0.47cm; margin-right:
-                               0.01cm"><br/>
+                               0.01cm">
+                                <br/>
 
                             </p>
                         </td>
@@ -663,20 +723,21 @@
 
             </section>
 
-            <section id="Chars">
-                <title>Chars</title>
+            <section id="Chars"><title>Chars</title>
+
                 <p>
                     Lucene writes unicode
                     character sequences using Java's
                     <a href="http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8">"modified
-                    UTF-8 encoding"</a>.
+                        UTF-8 encoding"</a>
+                    .
                 </p>
 
 
             </section>
 
-            <section id="String">
-                <title>String</title>
+            <section id="String"><title>String</title>
+
                 <p>
                     Lucene writes strings as a VInt representing the length, followed by
                     the character data.
@@ -690,40 +751,83 @@
 
         </section>
 
-        <section id="Per-Index Files">
-            <title>Per-Index Files</title>
+        <section id="Per-Index Files"><title>Per-Index Files</title>
+
             <p>
                 The files in this section exist one-per-index.
             </p>
 
-            <section id="Segments File">
-                <title>Segments File</title>
+            <section id="Segments File"><title>Segments File</title>
+
                 <p>
                     The active segments in the index are stored in the
-                    segment info file.  An index only has
-                    a single file in this format, and it is named "segments".
-                    This lists each segment by name, and also contains the size of each
-                    segment.
+                    segment info file,
+                    <tt>segments_N</tt>
+                    . There may
+                    be one or more
+                    <tt>segments_N</tt>
+                    files in the
+                    index; however, the one with the largest
+                    generation is the active one (when older
+                    segments_N files are present it's because they
+                    temporarily cannot be deleted, or, a writer is in
+                    the process of committing). This file lists each
+                    segment by name, has details about the separate
+                    norms and deletion files, and also contains the
+                    size of each segment.
+                </p>
+
+                <p>
+                    As of 2.1, there is also a file
+                    <tt>segments.gen</tt>
+                    . This file contains the
+                    current generation (the
+                    <tt>_N</tt>
+                    in
+                    <tt>segments_N</tt>
+                    ) of the index. This is
+                    used only as a fallback in case the current
+                    generation cannot be accurately determined by
+                    directory listing alone (as is the case for some
+                    NFS clients with time-based directory cache
+                    expiraation). This file simply contains an Int32
+                    version header (SegmentInfos.FORMAT_LOCKLESS =
+                    -2), followed by the generation recorded as Int64,
+                    written twice.
                 </p>
 
                 <p>
-                    Segments    --&gt; Format, Version, NameCounter, SegCount, &lt;SegName, SegSize&gt;<sup>SegCount</sup>
+                    <b>Pre-2.1:</b>
+                    Segments --&gt; Format, Version, NameCounter, SegCount, &lt;SegName, SegSize&gt;
+                    <sup>SegCount</sup>
+                </p>
+                <p>
+                    <b>2.1 and above:</b>
+                    Segments --&gt; Format, Version, NameCounter, SegCount, &lt;SegName, SegSize, DelGen, NumField, NormGen
+                    <sup>NumField</sup>
+                    &gt;
+                    <sup>SegCount</sup>
+                    , IsCompoundFile
                 </p>
 
                 <p>
-                    Format, NameCounter, SegCount, SegSize    --&gt; UInt32
+                    Format, NameCounter, SegCount, SegSize, NumField --&gt; Int32
                 </p>
 
                 <p>
-                    Version --&gt; UInt64
+                    Version, DelGen, NormGen --&gt; Int64
                 </p>
 
                 <p>
-                    SegName    --&gt; String
+                    SegName --&gt; String
                 </p>
 
                 <p>
-                    Format is -1 in Lucene 1.4.
+                    IsCompoundFile --&gt; Int8
+                </p>
+
+                <p>
+                    Format is -1 as of Lucene 1.4 and -2 as of Lucene 2.1.
                 </p>
 
                 <p>
@@ -744,97 +848,118 @@
                     SegSize is the number of documents contained in the segment index.
                 </p>
 
+                <p>
+                    DelGen is the generation count of the separate
+                    deletes file. If this is -1, there are no
+                    separate deletes. If it is 0, this is a pre-2.1
+                    segment and you must check filesystem for the
+                    existence of _X.del. Anything above zero means
+                    there are separate deletes (_X_N.del).
+                </p>
 
-            </section>
+                <p>
+                    NumField is the size of the array for NormGen, or
+                    -1 if there are no NormGens stored.
+                </p>
 
-            <section id="Lock Files">
-                <title>Lock Files</title>
                 <p>
-                    Several files are used to indicate that another
-                    process is using an index.  Note that these files are not
-                    stored in the index directory itself, but rather in the
-                    system's temporary directory, as indicated in the Java
-                    system property "java.io.tmpdir".
+                    NormGen records the generation of the separate
+                    norms files. If NumField is -1, there are no
+                    normGens stored and they are all assumed to be 0
+                    when the segment file was written pre-2.1 and all
+                    assumed to be -1 when the segments file is 2.1 or
+                    above. The generation then has the same meaning
+                    as delGen (above).
+                </p>
+
+                <p>
+                    IsCompoundFile records whether the segment is
+                    written as a compound file or not. If this is -1,
+                    the segment is not a compound file. If it is 1,
+                    the segment is a compound file. Else it is 0,
+                    which means we check filesystem to see if _X.cfs
+                    exists.
                 </p>
 
-                <ul>
-                    <li>
-                        <p>
-                            When a file named "commit.lock"
-                            is present, a process is currently re-writing the "segments"
-                            file and deleting outdated segment index files, or a process is
-                            reading the "segments"
-                            file and opening the files of the segments it names.  This lock file
-                            prevents files from being deleted by another process after a process
-                            has read the "segments"
-                            file but before it has managed to open all of the files of the
-                            segments named therein.
-                        </p>
-                    </li>
 
-                    <li>
-                        <p>
-                            When a file named "write.lock"
-                            is present, a process is currently adding documents to an index, or
-                            removing files from that index.  This lock file prevents several
-                            processes from attempting to modify an index at the same time.
-                        </p>
-                    </li>
-                </ul>
             </section>
 
-            <section id="Deletable File">
-                <title>Deletable File</title>
+            <section id="Lock File"><title>Lock File</title>
+
                 <p>
-                    A file named "deletable"
-                    contains the names of files that are no longer used by the index, but
-                    which could not be deleted.  This is only used on Win32, where a
-                    file may not be deleted while it is still open. On other platforms
-                    the file contains only null bytes.
+                    A write lock is used to indicate that another
+                    process is writing to the index. Note that this file is not
+                    stored in the index directory itself, but rather in the
+                    system's temporary directory, as indicated in the Java
+                    system property "java.io.tmpdir".
                 </p>
 
                 <p>
-                    Deletable    --&gt; DeletableCount,
-                    &lt;DelableName&gt;<sup>DeletableCount</sup>
+                    The write lock is named "XXXX-write.lock" where
+                    XXXX is typically a unique prefix computed by the
+                    directory path to the index. When this file is
+                    present, a process is currently adding documents
+                    to an index, or removing files from that index.
+                    This lock file prevents several processes from
+                    attempting to modify an index at the same time.
                 </p>
 
-                <p>DeletableCount    --&gt; UInt32
+                <p>
+                    Note that prior to version 2.1, Lucene also used a
+                    commit lock. This was removed in 2.1.
                 </p>
-                <p>DeletableName    --&gt;
-                    String
+
+            </section>
+
+            <section id="Deletable File"><title>Deletable File</title>
+
+                <p>
+                    Prior to Lucene 2.1 there was a file "deletable"
+                    that contained details about files that need to be
+                    deleted. As of 2.1, a writer dynamically computes
+                    the files that are deletable, instead, so no file
+                    is written.
                 </p>
+
             </section>
 
-            <section id="Compound Files">
-                <title>Compound Files</title>
-            	<p>Starting with Lucene 1.4 the compound file format became default. This
-            	is simply a container for all files described in the next section.</p>
-            	
-            	<p>Compound (.cfs) --&gt; FileCount, &lt;DataOffset, FileName&gt;<sup>FileCount</sup>,
-            		FileData<sup>FileCount</sup></p>
-            	
-            	<p>FileCount --&gt; VInt</p>
-            	
-            	<p>DataOffset --&gt; Long</p>
+            <section id="Compound Files"><title>Compound Files</title>
+
+                <p>Starting with Lucene 1.4 the compound file format became default. This
+                    is simply a container for all files described in the next section.</p>
 
-            	<p>FileName --&gt; String</p>
+                <p>Compound (.cfs) --&gt; FileCount, &lt;DataOffset, FileName&gt;
+                    <sup>FileCount</sup>
+                    ,
+                    FileData
+                    <sup>FileCount</sup>
+                </p>
+
+                <p>FileCount --&gt; VInt</p>
+
+                <p>DataOffset --&gt; Long</p>
 
-            	<p>FileData --&gt; raw file data</p>
+                <p>FileName --&gt; String</p>
+
+                <p>FileData --&gt; raw file data</p>
                 <p>The raw file data is the data from the individual files named above.</p>
-            	
+
             </section>
 
         </section>
 
-        <section id="Per-Segment Files">
-            <title>Per-Segment Files</title>
+        <section id="Per-Segment Files"><title>Per-Segment Files</title>
+
             <p>
                 The remaining files are all per-segment, and are
                 thus defined by suffix.
             </p>
-            <section id="Fields">
-                <title>Fields</title>
-                <p><br/><b>Field Info</b><br/></p>
+            <section id="Fields"><title>Fields</title>
+                <p>
+                    <br/>
+                    <b>Field Info</b>
+                    <br/>
+                </p>
 
                 <p>
                     Field names are
@@ -842,48 +967,55 @@
                 </p>
                 <p>
                     FieldInfos
-                    (.fnm)    --&gt; FieldsCount, &lt;FieldName,
-                    FieldBits&gt;<sup>FieldsCount</sup>
+                    (.fnm) --&gt; FieldsCount, &lt;FieldName,
+                    FieldBits&gt;
+                    <sup>FieldsCount</sup>
                 </p>
 
                 <p>
-                    FieldsCount    --&gt; VInt
+                    FieldsCount --&gt; VInt
                 </p>
 
                 <p>
-                    FieldName    --&gt; String
+                    FieldName --&gt; String
                 </p>
 
                 <p>
-                    FieldBits    --&gt; Byte
+                    FieldBits --&gt; Byte
                 </p>
 
                 <p>
-	          <ul>
-                    <li>
-                    The low-order bit is one for
-		    indexed fields, and zero for non-indexed fields.
-                    </li>
-		    <li>
-		    The second lowest-order
-                    bit is one for fields that have term vectors stored, and zero for fields
-                    without term vectors.  
-	            </li>
-                        <p><b>Lucene &gt;= 1.9:</b></p>
-		    <li> If the third lowest-order bit is set (0x04), term positions are stored with the term vectors. </li>
-		    <li> If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors. </li>
-		    <li> If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field. </li>
-		  </ul>
+                    <ul>
+                        <li>
+                            The low-order bit is one for
+                            indexed fields, and zero for non-indexed fields.
+                        </li>
+                        <li>
+                            The second lowest-order
+                            bit is one for fields that have term vectors stored, and zero for fields
+                            without term vectors.
+                        </li>
+                        <p>
+                            <b>Lucene &gt;= 1.9:</b>
+                        </p>
+                        <li>If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.</li>
+                        <li>If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.</li>
+                        <li>If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field.</li>
+                    </ul>
                 </p>
 
                 <p>
-                    Fields are numbered by their order in this file.  Thus field zero is
+                    Fields are numbered by their order in this file. Thus field zero is
                     the
-                    first field in the file, field one the next, and so on.  Note that,
+                    first field in the file, field one the next, and so on. Note that,
                     like document numbers, field numbers are segment relative.
                 </p>
 
-                <p><br/><b>Stored Fields</b><br/></p>
+                <p>
+                    <br/>
+                    <b>Stored Fields</b>
+                    <br/>
+                </p>
 
                 <p>
                     Stored fields are represented by two files:
@@ -902,17 +1034,24 @@
 
                         <p>
                             FieldIndex
-                            (.fdx)    --&gt;
-                            &lt;FieldValuesPosition&gt;<sup>SegSize</sup>
+                            (.fdx) --&gt;
+                            &lt;FieldValuesPosition&gt;
+                            <sup>SegSize</sup>
                         </p>
                         <p>FieldValuesPosition
                             --&gt; Uint64
                         </p>
                         <p>This
                             is used to find the location within the field data file of the
-                            fields of a particular document.  Because it contains fixed-length
-                            data, this file may be easily randomly accessed.  The position of
-                            document<i> n</i>'s<i> </i>field data is the Uint64 at <i>n*8</i> in
+                            fields of a particular document. Because it contains fixed-length
+                            data, this file may be easily randomly accessed. The position of
+                            document
+                            <i>n</i>
+                            's
+                            <i></i>
+                            field data is the Uint64 at
+                            <i>n*8</i>
+                            in
                             this file.
                         </p>
                     </li>
@@ -928,48 +1067,54 @@
                         </p>
 
                         <p>
-                            FieldData (.fdt)    --&gt;
-                            &lt;DocFieldData&gt;<sup>SegSize</sup>
+                            FieldData (.fdt) --&gt;
+                            &lt;DocFieldData&gt;
+                            <sup>SegSize</sup>
+                        </p>
+                        <p>DocFieldData --&gt;
+                            FieldCount, &lt;FieldNum, Bits, Value&gt;
+                            <sup>FieldCount</sup>
                         </p>
-                        <p>DocFieldData    --&gt;
-                            FieldCount, &lt;FieldNum, Bits, Value&gt;<sup>FieldCount</sup>
-                        </p>
-                        <p>FieldCount  --&gt;
+                        <p>FieldCount --&gt;
                             VInt
                         </p>
-                        <p>FieldNum    --&gt;
+                        <p>FieldNum --&gt;
                             VInt
                         </p>
-                        
-                        <p><b>Lucene &lt;= 1.4:</b></p>
-                        <p>Bits        --&gt;
+
+                        <p>
+                            <b>Lucene &lt;= 1.4:</b>
+                        </p>
+                        <p>Bits --&gt;
                             Byte
                         </p>
-                        <p>Value        --&gt;
+                        <p>Value --&gt;
                             String
                         </p>
-                        <p>Only the low-order bit of Bits is used.  It is one for
+                        <p>Only the low-order bit of Bits is used. It is one for
                             tokenized fields, and zero for non-tokenized fields.
                         </p>
-                        <p><b>Lucene &gt;= 1.9:</b></p>
-                        <p>Bits        --&gt;
+                        <p>
+                            <b>Lucene &gt;= 1.9:</b>
+                        </p>
+                        <p>Bits --&gt;
                             Byte
                         </p>
                         <p>
-                        <ul>
-                        	<li>low order bit is one for tokenized fields</li>
-                        	<li>second bit is one for fields containing binary data</li>
-                        	<li>third bit is one for fields with compression option enabled
-                        		(if compression is enabled, the algorithm used is ZLIB)</li>
-                        </ul>
+                            <ul>
+                                <li>low order bit is one for tokenized fields</li>
+                                <li>second bit is one for fields containing binary data</li>
+                                <li>third bit is one for fields with compression option enabled
+                                    (if compression is enabled, the algorithm used is ZLIB)</li>
+                            </ul>
                         </p>
-                        <p>Value        --&gt;
+                        <p>Value --&gt;
                             String | BinaryValue (depending on Bits)
                         </p>
-                        <p>BinaryValue        --&gt;
+                        <p>BinaryValue --&gt;
                             ValueSize, &lt;Byte&gt;^ValueSize
                         </p>
-                        <p>ValueSize        --&gt;
+                        <p>ValueSize --&gt;
                             VInt
                         </p>
 
@@ -977,8 +1122,8 @@
                 </ol>
 
             </section>
-            <section id="Term Dictionary">
-                <title>Term Dictionary</title>
+            <section id="Term Dictionary"><title>Term Dictionary</title>
+
                 <p>
                     The term dictionary is represented as two files:
                 </p>
@@ -992,35 +1137,38 @@
                             TermInfoFile (.tis)--&gt;
                             TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
                         </p>
-                        <p>TIVersion    --&gt;
+                        <p>TIVersion --&gt;
                             UInt32
                         </p>
-                        <p>TermCount    --&gt;
+                        <p>TermCount --&gt;
                             UInt64
                         </p>
-                        <p>IndexInterval    --&gt;
+                        <p>IndexInterval --&gt;
                             UInt32
                         </p>
-                        <p>SkipInterval   --&gt;
+                        <p>SkipInterval --&gt;
                             UInt32
                         </p>
-                        <p>TermInfos    --&gt;
-                            &lt;TermInfo&gt;<sup>TermCount</sup>
+                        <p>TermInfos --&gt;
+                            &lt;TermInfo&gt;
+                            <sup>TermCount</sup>
                         </p>
-                        <p>TermInfo    --&gt;
+                        <p>TermInfo --&gt;
                             &lt;Term, DocFreq, FreqDelta, ProxDelta, SkipDelta&gt;
                         </p>
-                        <p>Term        --&gt;
+                        <p>Term --&gt;
                             &lt;PrefixLength, Suffix, FieldNum&gt;
                         </p>
-                        <p>Suffix        --&gt;
+                        <p>Suffix --&gt;
                             String
                         </p>
                         <p>PrefixLength,
-                            DocFreq, FreqDelta, ProxDelta, SkipDelta<br/>        --&gt; VInt
+                            DocFreq, FreqDelta, ProxDelta, SkipDelta
+                            <br/>
+                            --&gt; VInt
                         </p>
                         <p>This
-                            file is sorted by Term.  Terms are ordered first lexicographically
+                            file is sorted by Term. Terms are ordered first lexicographically
                             by the term's field name, and within that lexicographically by the
                             term's text.
                         </p>
@@ -1028,9 +1176,9 @@
                             of this file and is -2 in Lucene 1.4.
                         </p>
                         <p>Term
-                            text prefixes are shared.  The PrefixLength is the number of initial
+                            text prefixes are shared. The PrefixLength is the number of initial
                             characters from the previous term which must be pre-pended to a
-                            term's suffix in order to form the term's text.  Thus, if the
+                            term's suffix in order to form the term's text. Thus, if the
                             previous term's text was "bone" and the term is "boy",
                             the PrefixLength is two and the suffix is "y".
                         </p>
@@ -1042,18 +1190,18 @@
                         </p>
                         <p>FreqDelta
                             determines the position of this term's TermFreqs within the .frq
-                            file.  In particular, it is the difference between the position of
+                            file. In particular, it is the difference between the position of
                             this term's data in that file and the position of the previous
                             term's data (or zero, for the first term in the file).
                         </p>
                         <p>ProxDelta
                             determines the position of this term's TermPositions within the .prx
-                            file.  In particular, it is the difference between the position of
+                            file. In particular, it is the difference between the position of
                             this term's data in that file and the position of the previous
                             term's data (or zero, for the first term in the file.
                         </p>
                         <p>SkipDelta determines the position of this
-                            term's SkipData within the .frq file.  In
+                            term's SkipData within the .frq file. In
                             particular, it is the number of bytes
                             after TermFreqs that the SkipData starts.
                             In other words, it is the length of the
@@ -1066,8 +1214,10 @@
                         </p>
 
                         <p>
-                            This contains every IndexInterval<sup>th</sup> entry from the .tis
-                            file, along with its location in the &quot;tis&quot; file.  This is
+                            This contains every IndexInterval
+                            <sup>th</sup>
+                            entry from the .tis
+                            file, along with its location in the &quot;tis&quot; file. This is
                             designed to be read entirely into memory and used to provide random
                             access to the &quot;tis&quot; file.
                         </p>
@@ -1079,28 +1229,29 @@
 
                         <p>
                             TermInfoIndex (.tii)--&gt;
-                            TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices 
+                            TIVersion, IndexTermCount, IndexInterval, SkipInterval, TermIndices
                         </p>
                         <p>TIVersion --&gt;
-                        	UInt32
+                            UInt32
                         </p>
-                        <p>IndexTermCount    --&gt;
+                        <p>IndexTermCount --&gt;
                             UInt64
                         </p>
                         <p>IndexInterval --&gt;
-                        	UInt32
+                            UInt32
                         </p>
                         <p>SkipInterval --&gt;
-                        	UInt32
+                            UInt32
                         </p>
-                        <p>TermIndices    --&gt;
-                            &lt;TermInfo, IndexDelta&gt;<sup>IndexTermCount</sup>
+                        <p>TermIndices --&gt;
+                            &lt;TermInfo, IndexDelta&gt;
+                            <sup>IndexTermCount</sup>
                         </p>
-                        <p>IndexDelta    --&gt;
+                        <p>IndexDelta --&gt;
                             VLong
                         </p>
                         <p>IndexDelta
-                            determines the position of this term's TermInfo within the .tis file.  In
+                            determines the position of this term's TermInfo within the .tis file. In
                             particular, it is the difference between the position of this term's
                             entry in that file and the position of the previous term's entry.
                         </p>
@@ -1112,29 +1263,32 @@
                 </ol>
             </section>
 
-            <section id="Frequencies">
-                <title>Frequencies</title>
+            <section id="Frequencies"><title>Frequencies</title>
+
                 <p>
                     The .frq file contains the lists of documents
                     which contain each term, along with the frequency of the term in that
                     document.
                 </p>
-                <p>FreqFile (.frq)    --&gt;
-                    &lt;TermFreqs, SkipData&gt;<sup>TermCount</sup>
+                <p>FreqFile (.frq) --&gt;
+                    &lt;TermFreqs, SkipData&gt;
+                    <sup>TermCount</sup>
+                </p>
+                <p>TermFreqs --&gt;
+                    &lt;TermFreq&gt;
+                    <sup>DocFreq</sup>
                 </p>
-                <p>TermFreqs    --&gt;
-                    &lt;TermFreq&gt;<sup>DocFreq</sup>
-                </p>
-                <p>TermFreq        --&gt;
+                <p>TermFreq --&gt;
                     DocDelta, Freq?
                 </p>
-                <p>SkipData        --&gt;
-                    &lt;SkipDatum&gt;<sup>DocFreq/SkipInterval</sup>
+                <p>SkipData --&gt;
+                    &lt;SkipDatum&gt;
+                    <sup>DocFreq/SkipInterval</sup>
                 </p>
-                <p>SkipDatum    --&gt;
+                <p>SkipDatum --&gt;
                     DocSkip,FreqSkip,ProxSkip
                 </p>
-                <p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip    --&gt;
+                <p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip --&gt;
                     VInt
                 </p>
                 <p>TermFreqs
@@ -1144,61 +1298,79 @@
                     entries are ordered by increasing document number.
                 </p>
                 <p>DocDelta
-                    determines both the document number and the frequency.  In
+                    determines both the document number and the frequency. In
                     particular, DocDelta/2 is the difference between this document number
                     and the previous document number (or zero when this is the first
-                    document in a TermFreqs).  When DocDelta is odd, the frequency is
-                    one.  When DocDelta is even, the frequency is read as another VInt.
+                    document in a TermFreqs). When DocDelta is odd, the frequency is
+                    one. When DocDelta is even, the frequency is read as another VInt.
                 </p>
                 <p>For
                     example, the TermFreqs for a term which occurs once in document seven
                     and three times in document eleven would be the following sequence of
                     VInts:
                 </p>
-                <p>    15,
+                <p>15,
                     8, 3
                 </p>
                 <p>DocSkip records the document number before every
-                    SkipInterval<sup>th</sup> document in TermFreqs.
+                    SkipInterval
+                    <sup>th</sup>
+                    document in TermFreqs.
                     Document numbers are represented as differences
-                    from the previous value in the sequence.  FreqSkip
+                    from the previous value in the sequence. FreqSkip
                     and ProxSkip record the position of every
-                    SkipInterval<sup>th</sup> entry in FreqFile and
-                    ProxFile, respectively.  File positions are
+                    SkipInterval
+                    <sup>th</sup>
+                    entry in FreqFile and
+                    ProxFile, respectively. File positions are
                     relative to the start of TermFreqs and Positions,
                     to the previous SkipDatum in the sequence.
                 </p>
                 <p>For example, if DocFreq=35 and SkipInterval=16,
                     then there are two SkipData entries, containing
-                    the 15<sup>th</sup> and 31<sup>st</sup> document
-                    numbers in TermFreqs.  The first FreqSkip names
+                    the 15
+                    <sup>th</sup>
+                    and 31
+                    <sup>st</sup>
+                    document
+                    numbers in TermFreqs. The first FreqSkip names
                     the number of bytes after the beginning of
-                    TermFreqs that the 16<sup>th</sup> SkipDatum
+                    TermFreqs that the 16
+                    <sup>th</sup>
+                    SkipDatum
                     starts, and the second the number of bytes after
-                    that that the 32<sup>nd</sup> starts.  The first
+                    that that the 32
+                    <sup>nd</sup>
+                    starts. The first
                     ProxSkip names the number of bytes after the
-                    beginning of Positions that the 16<sup>th</sup>
+                    beginning of Positions that the 16
+                    <sup>th</sup>
                     SkipDatum starts, and the second the number of
-                    bytes after that that the 32<sup>nd</sup> starts.
+                    bytes after that that the 32
+                    <sup>nd</sup>
+                    starts.
                 </p>
 
             </section>
-            <section id="Positions">
-                <title>Positions</title>
+            <section id="Positions"><title>Positions</title>
+
                 <p>
                     The .prx file contains the lists of positions that
                     each term occurs at within documents.
                 </p>
-                <p>ProxFile (.prx)    --&gt;
-                    &lt;TermPositions&gt;<sup>TermCount</sup>
+                <p>ProxFile (.prx) --&gt;
+                    &lt;TermPositions&gt;
+                    <sup>TermCount</sup>
+                </p>
+                <p>TermPositions --&gt;
+                    &lt;Positions&gt;
+                    <sup>DocFreq</sup>
+                </p>
+                <p>Positions --&gt;
+                    &lt;PositionDelta&gt;
+                    <sup>Freq</sup>
                 </p>
-                <p>TermPositions    --&gt;
-                    &lt;Positions&gt;<sup>DocFreq</sup>
-                </p>
-                <p>Positions        --&gt;
-                    &lt;PositionDelta&gt;<sup>Freq</sup>
-                </p>
-                <p>PositionDelta    --&gt;
+                <p>PositionDelta --&gt;
                     VInt
                 </p>
                 <p>TermPositions
@@ -1219,129 +1391,144 @@
                     fifth and ninth term in a subsequent document, would be the following
                     sequence of VInts:
                 </p>
-                <p>    4,
+                <p>4,
                     5, 4
                 </p>
             </section>
-            <section id="Normalization Factors">
-                <title>Normalization Factors</title>
+            <section id="Normalization Factors"><title>Normalization Factors</title>
                 <p>There's a norm file for each indexed field with a byte for
-                   each document.  The .f[0-9]* file contains,
+                    each document. The .f[0-9]* file contains,
                     for each document, a byte that encodes a value that is multiplied
                     into the score for hits on that field:
                 </p>
                 <p>Norms
-                    (.f[0-9]*)    --&gt; &lt;Byte&gt;<sup>SegSize</sup>
+                    (.f[0-9]*) --&gt; &lt;Byte&gt;
+                    <sup>SegSize</sup>
                 </p>
                 <p>Each
-                    byte encodes a floating point value.  Bits 0-2 contain the 3-bit
+                    byte encodes a floating point value. Bits 0-2 contain the 3-bit
                     mantissa, and bits 3-8 contain the 5-bit exponent.
                 </p>
                 <p>These
                     are converted to an IEEE single float value as follows:
                 </p>
                 <ol>
-                    <li><p>If
+                    <li>
+                        <p>If
                             the byte is zero, use a zero float.
                         </p>
                     </li>
-                    <li><p>Otherwise,
+                    <li>
+                        <p>Otherwise,
                             set the sign bit of the float to zero;
                         </p>
                     </li>
-                    <li><p>add
+                    <li>
+                        <p>add
                             48 to the exponent and use this as the float's exponent;
                         </p>
                     </li>
-                    <li><p>map
+                    <li>
+                        <p>map
                             the mantissa to the high-order 3 bits of the float's mantissa; and
 
                         </p>
                     </li>
-                    <li><p>set
+                    <li>
+                        <p>set
                             the low-order 21 bits of the float's mantissa to zero.
                         </p>
                     </li>
                 </ol>
 
             </section>
-            <section id="Term Vectors">
-                <title>Term Vectors</title>
-              Term Vector support is an optional on a field by field basis.  It consists of 4
-              files.
-              <ol>
-                <li>
-                  <p>The Document Index or .tvx file.</p>
-                  <p>This contains, for each document, a pointer to the document data in the Document 
-                    (.tvd) file.
-                  </p>
-                  <p>DocumentIndex (.tvx) --&gt; TVXVersion&lt;DocumentPosition&gt;<sup>NumDocs</sup></p>
-                  <p>TVXVersion --&gt; Int</p>
-                  <p>DocumentPosition   --&gt; UInt64</p>
-                  <p>This is used to find the position of the Document in the .tvd file.</p>
-                </li>
-                <li>
-                  <p>The Document or .tvd file.</p>
-                  <p>This contains, for each document, the number of fields, a list of the fields with
-                  term vector info and finally a list of pointers to the field information in the .tvf 
-                  (Term Vector Fields) file.</p>
-                  <p>
-                    Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums, FieldPositions,&gt;<sup>NumDocs</sup>
-                  </p>
-                  <p>TVDVersion --&gt; Int</p>
-                  <p>NumFields --&gt; VInt</p>
-                  <p>FieldNums --&gt; &lt;FieldNumDelta&gt;<sup>NumFields</sup></p>
-                  <p>FieldNumDelta --&gt; VInt</p>
-                  <p>FieldPositions --&gt; &lt;FieldPosition&gt;<sup>NumFields</sup></p>
-                  <p>FieldPosition --&gt; VLong</p>
-                  <p>The .tvd file is used to map out the fields that have term vectors stored and
-                  where the field information is in the .tvf file.</p>
-                </li>
-                <li>
-                  <p>The Field or .tvf file.</p>
-                  <p>This file contains, for each field that has a term vector stored, a list of
-                  the terms and their frequencies.</p>
-                  <p>Field (.tvf) --&gt; TVFVersion&lt;NumTerms, NumDistinct, TermFreqs&gt;<sup>NumFields</sup></p>
-                  <p>TVFVersion --&gt; Int</p>
-                  <p>NumTerms --&gt; VInt</p>
-                  <p>NumDistinct --&gt; VInt -- Future Use</p>
-                  <p>TermFreqs --&gt; &lt;TermText, TermFreq&gt;<sup>NumTerms</sup></p>
-                  <p>TermText --&gt; &lt;PrefixLength, Suffix&gt;</p>
-                  <p>PrefixLength --&gt; VInt</p>
-                  <p>Suffix --&gt; String</p>
-                  <p>TermFreq --&gt; VInt</p>
-                  <p>Term
-                      text prefixes are shared.  The PrefixLength is the number of initial
-                      characters from the previous term which must be pre-pended to a
-                      term's suffix in order to form the term's text.  Thus, if the
-                      previous term's text was "bone" and the term is "boy",
-                      the PrefixLength is two and the suffix is "y".
-                  </p>
-                </li>
-              </ol>
+            <section id="Term Vectors"><title>Term Vectors</title>
+                Term Vector support is an optional on a field by field basis. It consists of 4
+                files.
+                <ol>
+                    <li>
+                        <p>The Document Index or .tvx file.</p>
+                        <p>This contains, for each document, a pointer to the document data in the Document
+                            (.tvd) file.
+                        </p>
+                        <p>DocumentIndex (.tvx) --&gt; TVXVersion&lt;DocumentPosition&gt;
+                            <sup>NumDocs</sup>
+                        </p>
+                        <p>TVXVersion --&gt; Int</p>
+                        <p>DocumentPosition --&gt; UInt64</p>
+                        <p>This is used to find the position of the Document in the .tvd file.</p>
+                    </li>
+                    <li>
+                        <p>The Document or .tvd file.</p>
+                        <p>This contains, for each document, the number of fields, a list of the fields with
+                            term vector info and finally a list of pointers to the field information in the .tvf
+                            (Term Vector Fields) file.</p>
+                        <p>
+                            Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums, FieldPositions,&gt;
+                            <sup>NumDocs</sup>
+                        </p>
+                        <p>TVDVersion --&gt; Int</p>
+                        <p>NumFields --&gt; VInt</p>
+                        <p>FieldNums --&gt; &lt;FieldNumDelta&gt;
+                            <sup>NumFields</sup>
+                        </p>
+                        <p>FieldNumDelta --&gt; VInt</p>
+                        <p>FieldPositions --&gt; &lt;FieldPosition&gt;
+                            <sup>NumFields</sup>
+                        </p>
+                        <p>FieldPosition --&gt; VLong</p>
+                        <p>The .tvd file is used to map out the fields that have term vectors stored and
+                            where the field information is in the .tvf file.</p>
+                    </li>
+                    <li>
+                        <p>The Field or .tvf file.</p>
+                        <p>This file contains, for each field that has a term vector stored, a list of
+                            the terms and their frequencies.</p>
+                        <p>Field (.tvf) --&gt; TVFVersion&lt;NumTerms, NumDistinct, TermFreqs&gt;
+                            <sup>NumFields</sup>
+                        </p>
+                        <p>TVFVersion --&gt; Int</p>
+                        <p>NumTerms --&gt; VInt</p>
+                        <p>NumDistinct --&gt; VInt -- Future Use</p>
+                        <p>TermFreqs --&gt; &lt;TermText, TermFreq&gt;
+                            <sup>NumTerms</sup>
+                        </p>
+                        <p>TermText --&gt; &lt;PrefixLength, Suffix&gt;</p>
+                        <p>PrefixLength --&gt; VInt</p>
+                        <p>Suffix --&gt; String</p>
+                        <p>TermFreq --&gt; VInt</p>
+                        <p>Term
+                            text prefixes are shared. The PrefixLength is the number of initial
+                            characters from the previous term which must be pre-pended to a
+                            term's suffix in order to form the term's text. Thus, if the
+                            previous term's text was "bone" and the term is "boy",
+                            the PrefixLength is two and the suffix is "y".
+                        </p>
+                    </li>
+                </ol>
             </section>
 
-            <section id="Deleted Documents">
-                <title>Deleted Documents</title>
+            <section id="Deleted Documents"><title>Deleted Documents</title>
 
                 <p>The .del file is
                     optional, and only exists when a segment contains deletions:
                 </p>
 
                 <p>Deletions
-                    (.del)    --&gt; ByteCount,BitCount,Bits
+                    (.del) --&gt; ByteCount,BitCount,Bits
                 </p>
 
-                <p>ByteSize,BitCount    --&gt;
+                <p>ByteSize,BitCount --&gt;
                     Uint32
                 </p>
 
-                <p>Bits        --&gt;
-                    &lt;Byte&gt;<sup>ByteCount</sup>
+                <p>Bits --&gt;
+                    &lt;Byte&gt;
+                    <sup>ByteCount</sup>
                 </p>
 
                 <p>ByteCount
-                    indicates the number of bytes in Bits.  It is typically
+                    indicates the number of bytes in Bits. It is typically
                     (SegSize/8)+1.
                 </p>
 
@@ -1351,22 +1538,22 @@
                 </p>
 
                 <p>Bits
-                    contains one bit for each document indexed.  When the bit
+                    contains one bit for each document indexed. When the bit
                     corresponding to a document number is set, that document is marked as
-                    deleted.  Bit ordering is from least to most significant.  Thus, if
+                    deleted. Bit ordering is from least to most significant. Thus, if
                     Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as
                     deleted.
                 </p>
             </section>
         </section>
 
-        <section id="Limitations">
-            <title>Limitations</title>
+        <section id="Limitations"><title>Limitations</title>
+
             <p>There
                 are a few places where these file formats limit the maximum number of
                 terms and documents to a 32-bit quantity, or to approximately 4
-                billion.  This is not today a problem, but, in the long term,
-                probably will be.  These should therefore be replaced with either
+                billion. This is not today a problem, but, in the long term,
+                probably will be. These should therefore be replaced with either
                 UInt64 values, or better yet, with VInt values which have no limit.
             </p>
 



Mime
View raw message