lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Structure of .tii-file
Date Wed, 21 Jul 2010 09:13:45 GMT
Best explanation is the source code itself -- it should be correct ;)

Look at how SegmentTermsEnum.next is implemented, pre-flex.  (If
you're looking @ flex (= trunk), then the format is slightly different
and not yet correctly documented (issue is open)).

Yes vInt/vLong are the same, except vLong can take up to 9 bytes.  But
we look @ the high bit of each byte, and keep reading/shifting bytes
as long as that's 1.  It's a rather CPU unfriendly format since that
if is usually hard to predict.

Each term is delta coded against the last term, ie we only write the
changed suffix.  First vInt is suffix start.  Next vInt is suffix end.
 Then comes the bytes (UTF8 pre-flex, opaque in flex).  In your case
these both look to be 0?  Ie, first term is the empty string.  Next
comes the field number as a vInt (pre-flex), but at that point you
have -1 (encodes as FF FF FF FF 0F in vInt), which is odd -- field
numbers should be positive.  Must be missing something...

The deltas are then vLong's, delta coded.

Mike

On Wed, Jul 21, 2010 at 4:52 AM, Alexander vom Berg <mail@avomberg.de> wrote:
> Hello everybody,
>
> I am reading the file format paper and I check it against a created index.
> The documentation says:
> TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval,
> SkipInterval, MaxSkipLevels, TermIndices
>
> If I look into the .tii-file I see the following:
> TIVersion = FF FF FF FC  (4 Bytes)
> IndexTermCount = 00 00 00 00 00 00 00 0C = 10  (8 Bytes)
> IndexInterval = 00 00 00 80 = 128  (4 Bytes)
> SkipInterval = 00 00 00 10 = 16  (4 Bytes)
> MaxSkipLevels = 00 00 00 0A = 10 (4 Bytes)
> TermIndices = ?????  (? Bytes)
>
> I looked in two indexes and for both the following byte sequences are equal
> (marked bold):
> *00 00 FF FF FF FF 0F 00 00 00 18 00* (0B 61 or 0D30 .....)
>
> Maybe I don't understand the Map with *<TermInfo, IndexDelta>^IndexTermCount
> *. How should I calculate the correct byte length?
> I assume the IndexDelta with VLong has 8 bytes if the leading bit is 0
> (Similar vo VInt or is VLong somewhere described?). TermInfo is explained in
> the .tis file section.
>
> TermIndices   = <TermInfo, IndexDelta>
>
> = <(Term,DocFreq,FreqDelta,ProxDelta,SkipDelta), IndexDelta>
> = <([PrefixLength,Suffix,FieldNum],DocFreq,FreqDelta,ProxDelta,SkipDelta),
>         IndexDelta>
> = <([        00         ,  00     ,        FF  ],        FF  
,      FF
>  ,      FF      ,      0F      ),   00 00 00 18 00 0B 61 6E>
>
>
>
> IndexDelta is to large for my small index! Also DocFreq is to large because
> I only have 16 documents in total. :(
>
> Can somebody tell me how to read the bytes correctly from the file? I would
> like to find the correct position in the .tis file from .tii data.
>
> Best regards
> Alex
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message