lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Flexible indexing
Date Wed, 14 Mar 2007 04:19:16 GMT

On Mar 13, 2007, at 2:03 AM, Nicolas Lalevée wrote:

>> At present KS allows you to attach both a Similarity and an Analyzer
>> to a field name via a FieldSpec subclass.  I haven't quite figured
>> out how to attach a posting format.  Should it return an object, like
>> FieldSpec's similarity() method does?  Should it actually implement a
>> codec?  Not sure yet.  What do you think?
>
> The posting format defines how you want to store the terms data, so  
> defines
> how to search.

Hmm.  I'm talking about the stuff currently held in .frq, .prx,  
and .fNNN in Lucene.  That's not the terms data.  I think we're  
miscommunicating.

KinoSearch 0.20_01 and forward move the postings data  
from .frq, .prx, and .fNNN to a single file per field, with the  
extension .pNNN.  The philosophy of KS 0.20 is to have all binary  
"files" be decodable by launching a single iterator at the front of  
the file and having it read to the end.  (They're actually virtual  
files within the compound file -- KS only supports the compound  
format.)  That translates one posting format per file.

> I don't think it is a good idea to mix different kind of
> posting format in the same index.

Allowing different fields to use different posting formats is very  
important.

When matching a value in a "category" field, all you might care about  
is whether the doc hits or not -- you don't care about freq, boost,  
per-position boost, any of that.  The posting format for "category"  
would thus specify "doc num only", and the .pNNN file would consist  
entirely of a sequence of delta-doc_num VInts.

In contrast, a "content" field scoring HTML source material might  
specify a posting format that includes boost-per-position.  Each  
record would have one doc_num, one freq, several positions, and  
several boosts.  The file would be much more complex.

If you want to score based on "content", but constrain results based  
on "category", you want to allow the simpler format for the  
"category" field, or you'll be wasting both disk and CPU.

It's actually possible to make different multiple posting formats  
work within a single monolithic postings file, but I opted to avoid  
that for the sake of simplicity and ease of debugging.

> It will make Lucene the responsablilty to
> manage different kind of readers instanciating different kind of  
> termEnums
> and so on.

I've actually chosen to break up the term list into two separate  
files per field as well.  This was a more costly and dubious choice,  
but was harmonious with KinoSearch's expansion of field semantics.

KS will soon allow users to determine sort order of term texts within  
each field.  Keeping separate TermLists for each field means that I  
don't need to to worry about either tracking field numbers/names or  
switching up comparators -- the TermList iterator terminates rather  
than proceed on to another field like TermEnum does.

> I don't really know what will be the different kind of impact of a
> such feature, but it might be quite difficult to manage it  
> correctly. But as
> the posting format can be redefined by the user, he can implement a  
> custom
> format which is handling internally different kind of data  
> associated to
> terms.

If you guarantee that the posting format for a given field can never  
change by imposing global field semantics, it's not a big deal.  If  
you break things up by field at both the file and the data structure  
level, it gets even easier.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message