lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Balmain" <>
Subject Re: Global field semantics
Date Mon, 10 Jul 2006 08:04:36 GMT
On 7/10/06, Doug Cutting <> wrote:
> Chuck Williams wrote:
> > Lucene today allows many field properties to vary at the Field level.
> > E.g., the same field name might be tokenized in one Field on a Document
> > while it is untokenized in another Field on the same or different
> > Document.
> The rationale for this design was to keep the API simple.  I think of it
> like variable declarations: some languages require them and some don't.
>   I opted to make Lucene fields like dynamically-typed variables.  In
> part, Lucene's popularity is due to the simplicity of its API.

It's just now struck me the irony that most people are happy with the
"dynamically-typed" fields in Java (Lucene) but they didn't go down as
well in Ruby (Ferret).

> However, in my uses of Lucene, most documents have the same fields used
> in the same way, so I don't think I've ever actually taken much
> advantage of this functionality.  It is nice to be able to add a field
> to an index by changing the indexing code in a single place, where the
> field's value is created, and not having to also change the index
> initialization code.  We should try to keep such redundancies out of
> user code.
> Thus I would encourage any change in this direction to continue to
> permit fields to be defined lazily, the first time they are added,
> rather than requiring all fields to be declared up front.  Are there
> substantial optimizations that are only possible if all fields are known
> when the index is initialized?

I don't think declaring all fields up front is necessary for
substantial optimizations. I've found that the key to some really good
optimizations is having constant field numbers. That is, once a field
is added to the index it is assigned a field number and it it keeps
that field number for the life of the index. This allows one
FieldInfos object per index instead of one per segment. As I mentioned
earlier this greatly optimizes the merging of term vectors and stored
fields. The only problem I could find with this solution is that
fields are no longer in alphabetical order in the term dictionary but
I couldn't think of a use-case where this is necessary although I'm
sure there probably is one.

Anyway, hopefully we'll be able to lead the way with some brilliant
new ideas in the Lucy project. Put our money where our mouth is, so to
speak. If only I had a little more time right now.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message