lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cutt...@apache.org
Subject cvs commit: jakarta-lucene/xdocs fileformats.xml
Date Mon, 29 Mar 2004 22:30:40 GMT
cutting     2004/03/29 14:30:40

  Modified:    docs     fileformats.html whoweare.html
               xdocs    fileformats.xml
  Log:
  Updated file format documentation to note skip data.
  
  Revision  Changes    Path
  1.22      +48 -9     jakarta-lucene/docs/fileformats.html
  
  Index: fileformats.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/fileformats.html,v
  retrieving revision 1.21
  retrieving revision 1.22
  diff -u -r1.21 -r1.22
  --- fileformats.html	29 Mar 2004 12:46:36 -0000	1.21
  +++ fileformats.html	29 Mar 2004 22:30:40 -0000	1.22
  @@ -1332,9 +1332,18 @@
   
                           <p>
                               TermInfoFile (.tis)--&gt;
  -                            TermCount, TermInfos
  +                            TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
  +                        </p>
  +                        <p>TIVersion    --&gt;
  +                            UInt32
                           </p>
                           <p>TermCount    --&gt;
  +                            UInt64
  +                        </p>
  +                        <p>IndexInterval    --&gt;
  +                            UInt32
  +                        </p>
  +                        <p>SkipInterval   --&gt;
                               UInt32
                           </p>
                           <p>TermInfos    --&gt;
  @@ -1357,6 +1366,9 @@
                               by the term's field name, and within that lexicographically
by the
                               term's text.
                           </p>
  +                        <p>TIVersion names the version of the format
  +                            of this file and is -1 in Lucene 1.4.
  +                        </p>
                           <p>Term
                               text prefixes are shared.  The PrefixLength is the number of
initial
                               characters from the previous term which must be pre-pended
to a
  @@ -1389,7 +1401,7 @@
                           </p>
   
                           <p>
  -                            This contains every 128th entry from the .tis
  +                            This contains every IndexInterval<sup>th</sup>
entry from the .tis
                               file, along with its location in the "tis" file.  This is
                               designed to be read entirely into memory and used to provide
random
                               access to the "tis" file.
  @@ -1440,6 +1452,7 @@
                   </p>
                                                   <p>FreqFile (.frq)    --&gt;
                       &lt;TermFreqs&gt;<sup>TermCount</sup>
  +                    &lt;SkipDatum&gt;<sup>TermCount/SkipInterval</sup>
                   </p>
                                                   <p>TermFreqs    --&gt;
                       &lt;TermFreq&gt;<sup>DocFreq</sup>
  @@ -1447,7 +1460,10 @@
                                                   <p>TermFreq        --&gt;
                       DocDelta, Freq?
                   </p>
  -                                                <p>DocDelta,Freq    --&gt;
  +                                                <p>SkipDatum        --&gt;
  +                    DocSkip,FreqSkip,ProxSkip
  +                </p>
  +                                                <p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip
   --&gt;
                       VInt
                   </p>
                                                   <p>TermFreqs
  @@ -1471,6 +1487,29 @@
                                                   <p>    15,
                       22, 3
                   </p>
  +                                                <p>DocSkip records the document number
before every
  +                    SkipInterval<sup>th</sup> document in TermFreqs.
  +                    Document numbers are represented as differences
  +                    from the previous value in the sequence.  FreqSkip
  +                    and ProxSkip record the position of every
  +                    SkipInterval<sup>th</sup> entry in FreqFile and
  +                    ProxFile, respectively.  File positions are
  +                    relative to the start of TermFreqs and Positions,
  +                    to the previous SkipDatum in the sequence.
  +                </p>
  +                                                <p>For example, if TermCount=35 and
SkipInterval=16,
  +                    then there are two SkipData entries, containing
  +                    the 15<sup>th</sup> and 31<sup>st</sup> document
  +                    numbers in TermFreqs.  The first FreqSkip names
  +                    the number of bytes after the beginning of
  +                    TermFreqs that the 16<sup>th</sup> SkipDatum
  +                    starts, and the second the number of bytes after
  +                    that that the 32<sup>nd</sup> starts.  The first
  +                    ProxSkip names the number of bytes after the
  +                    beginning of Positions that the 16<sup>th</sup>
  +                    SkipDatum starts, and the second the number of
  +                    bytes after that that the 32<sup>nd</sup> starts.
  +                </p>
                               </blockquote>
         </td></tr>
         <tr><td><br/></td></tr>
  @@ -1588,8 +1627,8 @@
                     <p>This contains, for each document, a pointer to the document
data in the Document 
                       (.tvd) file.
                     </p>
  -                  <p>DocumentIndex (.tvx) --&gt; FormatVersion&lt;DocumentPosition&gt;<sup>NumDocs</sup></p>
  -                  <p>FormatVersion --&gt; Int</p>
  +                  <p>DocumentIndex (.tvx) --&gt; TVXVersion&lt;DocumentPosition&gt;<sup>NumDocs</sup></p>
  +                  <p>TVXVersion --&gt; Int</p>
                     <p>DocumentPosition   --&gt; UInt64</p>
                     <p>This is used to find the position of the Document in the .tvd
file.</p>
                   </li>
  @@ -1599,9 +1638,9 @@
                     term vector info and finally a list of pointers to the field information
in the .tvf 
                     (Term Vector Fields) file.</p>
                     <p>
  -                    Document (.tvd) --&gt; FormatVersion&lt;NumFields, FieldNums,
FieldPositions,&gt;<sup>NumDocs</sup>
  +                    Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums,
FieldPositions,&gt;<sup>NumDocs</sup>
                     </p>
  -                  <p>FormatVersion --&gt; Int</p>
  +                  <p>TVDVersion --&gt; Int</p>
                     <p>NumFields --&gt; VInt</p>
                     <p>FieldNums --&gt; &lt;FieldNumDelta&gt;<sup>NumFields</sup></p>
                     <p>FieldNumDelta --&gt; VInt</p>
  @@ -1614,8 +1653,8 @@
                     <p>The Field or .tvf file.</p>
                     <p>This file contains, for each field that has a term vector stored,
a list of
                     the terms and their frequencies.</p>
  -                  <p>Field (.tvf) --&gt; FormatVersion&lt;NumTerms, NumDistinct,
TermFreqs&gt;<sup>NumFields</sup></p>
  -                  <p>FormatVersion --&gt; Int</p>
  +                  <p>Field (.tvf) --&gt; TVFVersion&lt;NumTerms, NumDistinct,
TermFreqs&gt;<sup>NumFields</sup></p>
  +                  <p>TVFVersion --&gt; Int</p>
                     <p>NumTerms --&gt; VInt</p>
                     <p>NumDistinct --&gt; VInt -- Future Use</p>
                     <p>TermFreqs --&gt; &lt;TermText, TermFreq&gt;<sup>NumTerms</sup></p>
  
  
  
  1.38      +1 -1      jakarta-lucene/docs/whoweare.html
  
  Index: whoweare.html
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/docs/whoweare.html,v
  retrieving revision 1.37
  retrieving revision 1.38
  diff -u -r1.37 -r1.38
  --- whoweare.html	25 Mar 2004 13:24:22 -0000	1.37
  +++ whoweare.html	29 Mar 2004 22:30:40 -0000	1.38
  @@ -167,7 +167,7 @@
   limited contract work.</p>
   
   </li>
  -<li><b>Otis Gospodneti&#263;</b> (otis at apache.org)</li>
  +<li><b>Otis Gospodneti?</b> (otis at apache.org)</li>
   <li><b>Brian Goetz</b> (briangoetz at apache.org)</li>
   <li><b>Scott Ganyo</b> (scottganyo at apache.org)</li>
   <li><b>Eugene Gluzberg</b> (drag0n at apache.org)</li>
  
  
  
  1.9       +49 -9     jakarta-lucene/xdocs/fileformats.xml
  
  Index: fileformats.xml
  ===================================================================
  RCS file: /home/cvs/jakarta-lucene/xdocs/fileformats.xml,v
  retrieving revision 1.8
  retrieving revision 1.9
  diff -u -r1.8 -r1.9
  --- fileformats.xml	29 Mar 2004 12:46:36 -0000	1.8
  +++ fileformats.xml	29 Mar 2004 22:30:40 -0000	1.9
  @@ -905,9 +905,18 @@
   
                           <p>
                               TermInfoFile (.tis)--&gt;
  -                            TermCount, TermInfos
  +                            TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos
  +                        </p>
  +                        <p>TIVersion    --&gt;
  +                            UInt32
                           </p>
                           <p>TermCount    --&gt;
  +                            UInt64
  +                        </p>
  +                        <p>IndexInterval    --&gt;
  +                            UInt32
  +                        </p>
  +                        <p>SkipInterval   --&gt;
                               UInt32
                           </p>
                           <p>TermInfos    --&gt;
  @@ -930,6 +939,9 @@
                               by the term's field name, and within that lexicographically
by the
                               term's text.
                           </p>
  +                        <p>TIVersion names the version of the format
  +                            of this file and is -1 in Lucene 1.4.
  +                        </p>
                           <p>Term
                               text prefixes are shared.  The PrefixLength is the number of
initial
                               characters from the previous term which must be pre-pended
to a
  @@ -962,7 +974,7 @@
                           </p>
   
                           <p>
  -                            This contains every 128th entry from the .tis
  +                            This contains every IndexInterval<sup>th</sup>
entry from the .tis
                               file, along with its location in the &quot;tis&quot;
file.  This is
                               designed to be read entirely into memory and used to provide
random
                               access to the &quot;tis&quot; file.
  @@ -1005,6 +1017,7 @@
                   </p>
                   <p>FreqFile (.frq)    --&gt;
                       &lt;TermFreqs&gt;<sup>TermCount</sup>
  +                    &lt;SkipDatum&gt;<sup>TermCount/SkipInterval</sup>
                   </p>
                   <p>TermFreqs    --&gt;
                       &lt;TermFreq&gt;<sup>DocFreq</sup>
  @@ -1012,7 +1025,10 @@
                   <p>TermFreq        --&gt;
                       DocDelta, Freq?
                   </p>
  -                <p>DocDelta,Freq    --&gt;
  +                <p>SkipDatum        --&gt;
  +                    DocSkip,FreqSkip,ProxSkip
  +                </p>
  +                <p>DocDelta,Freq,DocSkip,FreqSkip,ProxSkip    --&gt;
                       VInt
                   </p>
                   <p>TermFreqs
  @@ -1036,6 +1052,30 @@
                   <p>    15,
                       22, 3
                   </p>
  +                <p>DocSkip records the document number before every
  +                    SkipInterval<sup>th</sup> document in TermFreqs.
  +                    Document numbers are represented as differences
  +                    from the previous value in the sequence.  FreqSkip
  +                    and ProxSkip record the position of every
  +                    SkipInterval<sup>th</sup> entry in FreqFile and
  +                    ProxFile, respectively.  File positions are
  +                    relative to the start of TermFreqs and Positions,
  +                    to the previous SkipDatum in the sequence.
  +                </p>
  +                <p>For example, if TermCount=35 and SkipInterval=16,
  +                    then there are two SkipData entries, containing
  +                    the 15<sup>th</sup> and 31<sup>st</sup> document
  +                    numbers in TermFreqs.  The first FreqSkip names
  +                    the number of bytes after the beginning of
  +                    TermFreqs that the 16<sup>th</sup> SkipDatum
  +                    starts, and the second the number of bytes after
  +                    that that the 32<sup>nd</sup> starts.  The first
  +                    ProxSkip names the number of bytes after the
  +                    beginning of Positions that the 16<sup>th</sup>
  +                    SkipDatum starts, and the second the number of
  +                    bytes after that that the 32<sup>nd</sup> starts.
  +                </p>
  +
               </subsection>
               <subsection name="Positions">
   
  @@ -1127,8 +1167,8 @@
                     <p>This contains, for each document, a pointer to the document
data in the Document 
                       (.tvd) file.
                     </p>
  -                  <p>DocumentIndex (.tvx) --&gt; FormatVersion&lt;DocumentPosition&gt;<sup>NumDocs</sup></p>
  -                  <p>FormatVersion --&gt; Int</p>
  +                  <p>DocumentIndex (.tvx) --&gt; TVXVersion&lt;DocumentPosition&gt;<sup>NumDocs</sup></p>
  +                  <p>TVXVersion --&gt; Int</p>
                     <p>DocumentPosition   --&gt; UInt64</p>
                     <p>This is used to find the position of the Document in the .tvd
file.</p>
                   </li>
  @@ -1138,9 +1178,9 @@
                     term vector info and finally a list of pointers to the field information
in the .tvf 
                     (Term Vector Fields) file.</p>
                     <p>
  -                    Document (.tvd) --&gt; FormatVersion&lt;NumFields, FieldNums,
FieldPositions,&gt;<sup>NumDocs</sup>
  +                    Document (.tvd) --&gt; TVDVersion&lt;NumFields, FieldNums,
FieldPositions,&gt;<sup>NumDocs</sup>
                     </p>
  -                  <p>FormatVersion --&gt; Int</p>
  +                  <p>TVDVersion --&gt; Int</p>
                     <p>NumFields --&gt; VInt</p>
                     <p>FieldNums --&gt; &lt;FieldNumDelta&gt;<sup>NumFields</sup></p>
                     <p>FieldNumDelta --&gt; VInt</p>
  @@ -1153,8 +1193,8 @@
                     <p>The Field or .tvf file.</p>
                     <p>This file contains, for each field that has a term vector stored,
a list of
                     the terms and their frequencies.</p>
  -                  <p>Field (.tvf) --&gt; FormatVersion&lt;NumTerms, NumDistinct,
TermFreqs&gt;<sup>NumFields</sup></p>
  -                  <p>FormatVersion --&gt; Int</p>
  +                  <p>Field (.tvf) --&gt; TVFVersion&lt;NumTerms, NumDistinct,
TermFreqs&gt;<sup>NumFields</sup></p>
  +                  <p>TVFVersion --&gt; Int</p>
                     <p>NumTerms --&gt; VInt</p>
                     <p>NumDistinct --&gt; VInt -- Future Use</p>
                     <p>TermFreqs --&gt; &lt;TermText, TermFreq&gt;<sup>NumTerms</sup></p>
  
  
  

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message