Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 94419 invoked from network); 21 Jul 2010 08:52:55 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Jul 2010 08:52:55 -0000 Received: (qmail 76835 invoked by uid 500); 21 Jul 2010 08:52:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 76559 invoked by uid 500); 21 Jul 2010 08:52:48 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 76545 invoked by uid 99); 21 Jul 2010 08:52:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jul 2010 08:52:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [88.198.74.36] (HELO octopussy.animated-webstyles.de) (88.198.74.36) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jul 2010 08:52:38 +0000 Received: from mail-gx0-f176.google.com (mail-gx0-f176.google.com [209.85.161.176]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by octopussy.animated-webstyles.de (Postfix) with ESMTP id A3C4140DF12 for ; Wed, 21 Jul 2010 10:52:17 +0200 (CEST) Received: by gxk7 with SMTP id 7so5439003gxk.35 for ; Wed, 21 Jul 2010 01:52:16 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.122.12 with SMTP id u12mr2093950anc.112.1279702336121; Wed, 21 Jul 2010 01:52:16 -0700 (PDT) Received: by 10.42.36.77 with HTTP; Wed, 21 Jul 2010 01:52:16 -0700 (PDT) Date: Wed, 21 Jul 2010 10:52:16 +0200 Message-ID: Subject: Structure of .tii-file From: Alexander vom Berg To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e642d64a29b6eb048be1ebfd X-Virus-Checked: Checked by ClamAV on apache.org --0016e642d64a29b6eb048be1ebfd Content-Type: text/plain; charset=ISO-8859-1 Hello everybody, I am reading the file format paper and I check it against a created index. The documentation says: TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices If I look into the .tii-file I see the following: TIVersion = FF FF FF FC (4 Bytes) IndexTermCount = 00 00 00 00 00 00 00 0C = 10 (8 Bytes) IndexInterval = 00 00 00 80 = 128 (4 Bytes) SkipInterval = 00 00 00 10 = 16 (4 Bytes) MaxSkipLevels = 00 00 00 0A = 10 (4 Bytes) TermIndices = ????? (? Bytes) I looked in two indexes and for both the following byte sequences are equal (marked bold): *00 00 FF FF FF FF 0F 00 00 00 18 00* (0B 61 or 0D30 .....) Maybe I don't understand the Map with *^IndexTermCount *. How should I calculate the correct byte length? I assume the IndexDelta with VLong has 8 bytes if the leading bit is 0 (Similar vo VInt or is VLong somewhere described?). TermInfo is explained in the .tis file section. TermIndices = = <(Term,DocFreq,FreqDelta,ProxDelta,SkipDelta), IndexDelta> = <([PrefixLength,Suffix,FieldNum],DocFreq,FreqDelta,ProxDelta,SkipDelta), IndexDelta> = <([ 00 , 00 , FF ], FF , FF , FF , 0F ), 00 00 00 18 00 0B 61 6E> IndexDelta is to large for my small index! Also DocFreq is to large because I only have 16 documents in total. :( Can somebody tell me how to read the bytes correctly from the file? I would like to find the correct position in the .tis file from .tii data. Best regards Alex --0016e642d64a29b6eb048be1ebfd--