lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: mg4j - Managing Gigabyte for Java
Date Thu, 16 Sep 2004 18:55:47 GMT
Antonio Gulli wrote:
> Just a question: my personal experience with a commercial engine i 
> partly developed is the the "continuation bit" (aka altavista solution)  
> is a good and efficient solution w.r.t gamma code, delta code and other 
> codes used for variable lenght int rappresentation (see MG).
> 
> Given an int say n, continuation bit is just to consider a byte as 7 bit 
> + 1 bit used to say if the next byte is also used to rappresent n.

This is what Lucene uses for the reasons you mention: it is a good 
compromise between compression and performance.

Long-term I'd like to make Lucene's posting format extensible.  In 
addition to altering the compression method, the granularity of the 
index should be flexible.  Currently postings for all indexed fields 
consist of  <document, frequency, <position*> > tuples.  Instead, folks 
should be able to have postings like:
   . <document> for pure boolean matching only
   . <document, weight> for vector matching, no phrases
   . <document, frequency, <position, weight>* > for boosting term 
occurrences by, e.g., position in document, bolding, headings, etc.

Extending Lucene to efficiently and flexibly support this will be a 
design challenge, but I think it will benefit lots of applications.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message