lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Flexible indexing
Date Mon, 12 Mar 2007 20:34:55 GMT

On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:

> - Introduce index format. Nicolas has already written a lot of code  
> in this regard!

I worry that going the interface route is going to be too  
restrictive.  When I looked at Nicholas's index format spec, I  
immediately wanted to add an Analyzer and a bunch of other stuff to  
it.  Other people are going to want to add their own stuff.

My suggestion is that the top-level plan for the index be called  
Schema, and that it be an abstract class.  An email to the KS list  
explaining the rationale behind KinoSearch's current version of this  
is below my sig.  Here are the API docs:

   http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/ 
Schema.html
   http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Schema/ 
FieldSpec.html

It uses global field semantics, which Hoss won't be happy about.  ;)   
However, I'm grateful to Hoss for past critiques, as they've helped  
me to refine and improve how Schema works.  For instance, as of KS  
0.20_02 you can introduce new field_name => FieldSpec associations to  
KS at any time during indexing.

It may be that adapting Lucene to use something like what KS uses  
would be too radical a change.  However, I believe that one reason  
that flexible indexing has been in incubation so long is that the  
current mechanism for attaching semantics to field names does not  
scale as well as it might.

For instance, the logical extension of the current FieldInfos system  
is to add booleans as described at <http://wiki.apache.org/lucene- 
java/FlexibleIndexing>.  However, conflict resolution during segment  
merging is going to present challenges.  What happens when in one  
segment 'content' has freq and in another segment it doesn't?  Things  
are so much easier if the posting format, once set, never changes.

> It will include different interfaces for the different extension  
> points (FieldsFormat, PostingFormat, DictionaryFormat).

KS still uses TermDocs and its children, but I'm about to go in and  
replace them with PostingList.  What subclass of Posting the  
PostingList returns would be controlled by the FieldSpec.

At present KS allows you to attach both a Similarity and an Analyzer  
to a field name via a FieldSpec subclass.  I haven't quite figured  
out how to attach a posting format.  Should it return an object, like  
FieldSpec's similarity() method does?  Should it actually implement a  
codec?  Not sure yet.  What do you think?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

--------------------------------------------------------------------

Begin forwarded message:
From: Marvin Humphrey <marvin@rectangular.com>
Date: February 27, 2007 1:08:33 AM PST
To: KinoSearch discussion forum <kinosearch@rectangular.com>
Subject: [KinoSearch] KinoSearch::Schema - Rationale
Reply-To: KinoSearch discussion forum <kinosearch@rectangular.com>

Greets,

The thing about Lucene/KS indexes is that all the information you  
need to read them can never be stored in the index files alone  
because there's always that bleedin' Analyzer.  You can look at a  
Lucene index and see that it has fields with certain names that are  
indexed, stored, etc, but you can't actually make sense of the  
index's content unless you know everything about all Analyzers used  
at index-time.

Since the Analyzer is not hooked to the index file, but has to be  
created anew in every app that interacts with the index, it's often  
wrong, and analyzer mismatches are a constant source of confusion,  
frustration, and error for users.

KinoSearch::Schema solves the Analyzer problem.  Not only that, but  
it sets the stage for attaching ever more semantic meaning to field  
names.  Not just booleans like "I'm indexed" and "I'm stored", but  
behaviors, objects...  For example, each field may now be associated  
with its own Similarity implementation, which affects scoring.  In  
the reasonably near future, the plan is to allow each FieldSpec to  
define a comparison sub which determines the sort order of terms.   
And so on.

Schema is somewhat akin to SWISH's index configuration file, which  
can hold regexes, stoplists, and so on.  In fact, an earlier  
incarnation of Schema was primarily concerned with reading/writing a  
configuration file.  It attempted to solve the Lucene Analyzer  
problem by storing EVERYTHING, including a class name for the  
Analyzer; at search-time, the Analyzer object was created by calling  
a no-arg constructor.

I ash-canned that design after trying to write docs explaining the  
bit about the no-arg constructor -- too confusing, not Perlish, and  
ultimately, less direct than allowing the user to write arbitrary  
code.  It's hard to maintain security, though, when you allow data  
files to contain code.  (I'm sure SWISH manages it, I just don't want  
the same headache).

The thinking behind KinoSearch::Schema is, if you're going to create  
a index configuration file that has code in it, why not go all the  
way, and make it a Perl module?  It's the best of all worlds.  You  
get to leverage the power of the language itself when defining your  
index structure, but it's also a self-contained, complete spec that  
both your indexing app and your search app can load.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch






---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message