lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <>
Subject termIndexInterval, CheckIndex, size of tis file and Lucene index compression
Date Mon, 21 Mar 2011 17:15:41 GMT
I'm trying to get a feel for the impact of changing the termIndexInterval from the default
of 128 to 1024 (8 * 128).  This reduces the size of the tii file by 1/8th but in the worst
case requires doing a linear scan of 1024 terms instead of 128 in memory.   I'm not so concerned
about the performance impact of the in-memory scan, but I was trying to get an idea about
how this affects disk I/O. i.e. assuming a term is not in the tii file, we need to  load 1024
terms from the tis file instead of 128.

I looked at the output of a CheckIndex on one of our very large segments to get the number
of terms in the segment (see below) and got about 2.7 billion terms. (We have lots of dirty
OCR from 400 languages) .  The tis file is about  24.7 GB. I divided the size of the tis file
for that segment in bytes by the number of terms to get the average number of bytes/term:

(24.7 * (10^9) bytes ) / (2.7 * (10^9) terms) = 9 bytes/term.

This is the average size of a term entry in the tis file (assuming CheckIndex and ls outputs
are correct).
This seems too small.   Looking at the Lucene File formats doc (excerpt below), if we assume
that everything other than the Suffix of the term takes a VInt that only occupies 1 byte,
we have 6 bytes for that data, which leaves only 3 bytes for the String that holds the Suffix.

What am I missing here?

Tom Burton-West


>From the Lucene File formats doc:

TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta>
Term --> <PrefixLength, Suffix, FieldNum>
Suffix --> String
PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta
--> VInt

1 of 2: name=_2cj docCount=708,639
    size (MB)=393,395.313
    diagnostics = {optimize=true, mergeFactor=9, os.version=2.6.18-238.1.1.el5, os=Linux,
mergeDocStores=true, lu
cene.version=3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10, source=merge, os.arch=amd64, java.version=1.6.0_20,
.vendor=Sun Microsystems Inc.}
    has deletions [delFileName=_2cj_2.del]
    test: open reader.........OK [24 deleted docs]
    test: fields..............OK [55 fields]
    test: field norms.........OK [17 fields]
    test: terms, freq, prox...OK [2,723,440,775 terms; 35740903735 terms/docs pairs; 154861967859
    test: stored fields.......OK [11040443 total field count; avg 15.58 fields per doc]
    test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per

[xxx@shotz-1 index]$ ls -l _2cj.tis
-rw-rw-r-- 1 tomcat dlps 24,775,378,328 Mar 12 17:16 _2cj.tis

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message