lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Global field semantics
Date Sun, 09 Jul 2006 06:13:13 GMT

On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:

> Many things would be cleaner in Lucene if fields had a global  
> semantics,
> i.e., if properties like text vs. binary, Index, Store, TermVector,  
> the
> appropriate Analyzer, the assignment of Directory in ParallelReader  
> (or
> ParallelWriter), etc. were a function of just the field name and the
> index.

This is the direction I would like to go.

> This approach would naturally admit a class, say IndexFieldSet,
> that would hold global field semantics for an index.
> Lucene today allows many field properties to vary at the Field level.
> E.g., the same field name might be tokenized in one Field on a  
> Document
> while it is untokenized in another Field on the same or different
> Document.  Does anybody know how often this flexibility is used?  Are
> there interesting use cases for which it is important?  It seems to me
> this functionality is already problematic and not fully supported;  
> e.g.,
> indexing can manage tokenization-variant fields, but query parsing
> cannot.  Various extensions to Lucene exacerbate this kind of problem.
> Perhaps more controversially, the notion of global field semantics  
> would
> be even stronger if the set of fields is closed.  This would allow,  
> for
> example, QueryParser to validate field names.  This has a number of
> benefits, including for example avoiding false-negative "no  
> results" due
> to misspelling a field name.
> Has this been considered before?

Robert Kirchgessner made some of the same arguments in a January  
thread.  They were compelling then, and they're compelling now.

In June, Dave Balmain and I discussed the issue extensively on the  
Ferret list.  It might have been nice to use the Lucy list, since a  
lot of the discussion was about Lucy, but the Lucy lists didn't exist  
at the time.

Thoughts on the document storage that occurred to me after that  
discussion: maybe the fdx file should spec two numbers: a file  
pointer, and a integer which indicates the class of object stored at  
that position in the fdt file.  The registry which maps integers to  
classes could be stored in some centralized file.  Perhaps one of  
these classes -- a LazyDoc -- could specify that only a few integer  
file pointers should be read right away, deferring reading of field  
data until later.

> Are there good reasons this path has not been followed?

Hoss, that's your cue.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message