lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 18927] - [PATCH] Term Vector support
Date Mon, 09 Feb 2004 16:16:30 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=18927

[PATCH] Term Vector support

grant_ingersoll@yahoo.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|Store                       |Website



------- Additional Comments From grant_ingersoll@yahoo.com  2004-02-09 16:16 -------
Below is the diff produced on the File Formats XML file located in xdocs, as 
promised.  I trust it will be checked for accuracy.  Let me know if there are 
any mistakes and I will fix them.

cvs diff -Nu fileformats.xml

Index: fileformats.xml
===================================================================
RCS file: /home/cvspublic/jakarta-lucene/xdocs/fileformats.xml,v
retrieving revision 1.6
diff -u -r1.6 fileformats.xml
--- fileformats.xml	13 Oct 2003 13:53:08 -0000	1.6
+++ fileformats.xml	9 Feb 2004 16:08:57 -0000
@@ -224,7 +224,11 @@
                         multiplied into the score for hits on that field.
                     </p>
                 </li>
-
+                <li><p>Term Vectors.  For each field in each document, the 
term vector
+                       (sometimes called document vector) is stored.  A term 
vector consists
+                       of the term text, term frequency and term position.
+                    </p>
+                </li>              
                 <li><p>Deleted documents.
                         An optional file indicating which documents are 
deleted.
                     </p>
@@ -804,9 +808,10 @@
                 </p>
 
                 <p>
-                    Currently only the low-order bit is used of FieldBits is 
used.  It is
-                    one for
-                    indexed fields, and zero for non-indexed fields.
+                    The low-order bit is one for
+                    indexed fields, and zero for non-indexed fields.  The 
second lowest-order
+                    bit is one for fields that have term vectors stored, and 
zero for fields
+                    without term vectors.
                 </p>
 
                 <p>
@@ -1112,6 +1117,57 @@
                     </li>
                 </ol>
 
+            </subsection>
+            <subsection name="Term Vectors">
+              Term Vector support is an optional on a field by field basis.  
It consists of 4
+              files.
+              <ol>
+                <li>
+                  <p>The Document Index or .tvx file.</p>
+                  <p>This contains, for each document, a pointer to the 
document data in the Document 
+                    (.tvd) file.
+                  </p>
+                  <p>DocumentIndex (.tvx) --&gt; 
&lt;DocumentPosition&gt;<sup>NumDocs</sup></p>
+                  <p>DocumentPosition   --&gt; UInt64</p>
+                  <p>This is used to find the position of the Document in 
the .tvd file.</p>
+                </li>
+                <li>
+                  <p>The Document or .tvd file.</p>
+                  <p>This contains, for each document, the number of fields, a 
list of the fields with
+                  term vector info and finally a list of pointers to the field 
information in the .tvf 
+                  (Term Vector Fields) file.</p>
+                  <p>
+                    Document (.tvd) --&gt; &lt;NumFields, FieldNums, 
FieldPositions,&gt;<sup>NumDocs</sup>
+                  </p>
+                  <p>NumFields --&gt; VInt</p>
+                  <p>FieldNums --&gt; 
&lt;FieldNumDelta&gt;<sup>NumFields</sup></p>
+                  <p>FieldNumDelta --&gt; VInt</p>
+                  <p>FieldPositions --&gt; 
&lt;FieldPosition&gt;<sup>NumFields</sup></p>
+                  <p>FieldPosition --&gt; VLong</p>
+                  <p>The .tvd file is used to map out the fields that have 
term vectors stored and
+                  where the field information is in the .tvf file.</p>
+                </li>
+                <li>
+                  <p>The Field or .tvf file.</p>
+                  <p>This file contains, for each field that has a term vector 
stored, a list of
+                  the terms and their frequencies.</p>
+                  <p>Field (.tvf) --&gt; &lt;NumTerms, NumDistinct, TermFreqs,

TermPositionPointerDelta&gt;<sup>NumFields</sup></p>
+                  <p>NumTerms --&gt; VInt</p>
+                  <p>NumDistinct --&gt; VInt -- Future Use</p>
+                  <p>TermFreqs --&gt; &lt;TermText, 
TermFreq&gt;<sup>NumTerms</sup></p>
+                  <p>TermText --&gt; String</p>
+                  <p>TermFreq --&gt; VInt</p>
+                  <p>TermPositionPointerDelta --&gt; VLong</p>
+                  <p></p>
+                </li>
+                <li>
+                  <p>The Positions or .tvp file.</p>
+                  <p>This contains, for each term in the Field and Document, 
the positional information for
+                  each term in the document. </p>
+                  <p>Positions (.tvp) --&gt; 
&lt;PositionDelta&gt;<sup>NumPositions</sup></p>
+                  <p>PositionDelta --&gt; VInt</p>
+                </li>
+              </ol>
             </subsection>
 
             <subsection name="Deleted Documents">

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message