lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Term vectors: .tvf format question
Date Sun, 13 Jun 2004 03:46:34 GMT
Erik Hatcher wrote:
> I'm digging deeper into the Lucene index format to develop some higher 
> level diagrams of its structure.   One thing that is curious to me is 
> the term text being stored in the .tvf file.  Why not point to the term 
> dictionary by position somehow and avoid duplicating this string, saving 
> possibly substantial index size?  I'm assuming this is for performance 
> reasons.

The prefix compression helps some, but you're right, each term in a 
vector requires several bytes when it could optimally be represented as 
perhaps just one or two bytes on average if we numbered terms.

The problem is maintaining the numbering as the index grows and changes. 
  Lucene indexes grow by merging segments.  With term numbers, each 
segment would have a separate term numbering system.  Terms would be 
renumbered as segments are merged.  This is not hard to implement.  When 
you merge the term dictionaries, keep an array per segment mapping its 
old term numbers to new term numbers in the merged index.  Then use 
these arrays to upgrade the vectors to the new numbering as they're 
copied into the new segment index.  So far so good.  It requires 4 bytes 
per document of RAM when merging.  That makes optimizing large indexes 
much more memory intensive than it is currently, but not prohibitively.

But what happens when you have an unoptimized index and you want to 
compare vectors from two different segments?  There's no way to do this 
without looking up all of the terms in each segment's term dictionary. 
This requires a random disk access per vector term and would hence be 
prohibitively slow.  MultiSearcher would have the same problem.

So term-number-based vectors would be small and fast to use if all 
you're using is a single, optimized index, but very slow to use with 
unoptimized indexes and multiple indexes.  That seems like a bad 
situtation, so, unless someone figures out another way, we're stuck with 
the current approach.  Vectors are bigger and slower than optimal, but 
they're consistently so.

> Note, the Lucene index file formats documentation needs to be updated - 
> TermText is no longer just a String, it is a <PrefixLength,Suffix> 
> similar to how terms in the .tis are stored.  I've updated 
> fileformats.xml/.html - if I've gotten this wrong, let me know.

Looks good to me.  Thanks for catching this!

> Just out of curiosity - are there any other known inconsistencies with 
> the file formats documentation?

Good question.  Let me think...

The segments file has also changed format, and this is not yet reflected 
in the file format documentation.

The skip data description is new.  The text is clumsy, but I think it is 
  mostly accurate.  One mistake is that TIFormat is now -2, not -1. 
Other than that, it looks right to me.

We should probably also somewhere make clear what's changed.  We promise 
to do so at the top of the file, but don't.  So perhaps sections which 
have changed should get "since 1.4" or "changed in 1.4" notices or 
somesuch.  This will make life much easier for ports.

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message