lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Balmain" <dbalmain...@gmail.com>
Subject Re: Global field semantics
Date Mon, 10 Jul 2006 04:48:10 GMT
On 7/10/06, Chuck Williams <chuck@manawiz.com> wrote:
> David Balmain wrote on 07/09/2006 06:44 PM:
> > On 7/10/06, Chuck Williams <chuck@manawiz.com> wrote:
> >> Marvin Humphrey wrote on 07/08/2006 11:13 PM:
> >> >
> >> > On Jul 8, 2006, at 9:46 AM, Chuck Williams wrote:
> >> >
> >> >> Many things would be cleaner in Lucene if fields had a global
> >> semantics,
> >> >> i.e., if properties like text vs. binary, Index, Store,
> >> TermVector, the
> >> >> appropriate Analyzer, the assignment of Directory in
> >> ParallelReader (or
> >> >> ParallelWriter), etc. were a function of just the field name and the
> >> >> index.
> >> >
> >> > In June, Dave Balmain and I discussed the issue extensively on the
> >> > Ferret list.  It might have been nice to use the Lucy list, since a
> >> > lot of the discussion was about Lucy, but the Lucy lists didn't exist
> >> > at the time.
> >> >
> >> > http://rubyforge.org/pipermail/ferret-talk/2006-June/000536.html
> >> >
> >> I think there are a number of problems with that proposal and hope it
> >> was not adopted.
> >
> > Hi Chuck,
> >
> > Actually, it was adopted and I'm quite happy with the solution. I'd be
> > very interested to hear what the number of problems are, besides the
> > example you've already given. Even if you never use Ferret, it can
> > only help me improve my software.
>
> Hi David,
>
> Thanks for your reply.
>
> I'm not aware of other problems beyond the ones I've already cited.
> After thinking of these, my confidence that there were not others waned.
>
> >
> > I'll start by covering your term-vector example. By adding fixed
> > index-wide field properties to Ferret I was able to obtain up to a
> > huge speed improvement during indexing.
>
> This is very interesting.  Can you say how much?

About a factor of 5 times. I won't compare it to Lucenes speed though
as I know that's asking for trouble. You'll be able to try it yourself
in a week or so when I finally release it.

> > With the CPU time I gain in Ferret I could
> > easily re-analyze large fields and build term vectors for them
> > separately. It's a little more work for less common use cases like
> > yours but in the end, everyone benifits in terms of performance.
>
> Does Ferret work this way, or would that be up to the application?

Currently that would be up to the application.

> >> As my earlier example showed, there is at least one
> >> valid use case where storing a term vector is not an invariant property
> >> of a field; specifically, when using term vectors to optimize excerpt
> >> generation, it is best to store them only for fields that have long
> >> values.  This is even a counter-example to Karl's proposal, since a
> >> single Document may have multiple fields of the same name, some with
> >> long values and others with short values; multiple fields of the same
> >> name may legitimately have different TermVector settings even on a
> >> single Document.
> >
> > I think you'll find if you look at the DocumentWriter#writePostings
> > method that it's "one in, all in" in terms of storing term vectors for
> > a field. That is, if you have 5 "content" fields and only one of those
> > is set to store term vectors, then all of the fields will store term
> > vectors.
>
> Right you are, and clearly necessarily so since the values of the
> multiple fields are implicitly concatenated (with
> positionIncrementGap).  So, Lucene already limits my term vector
> optimization to the Document level.  As it happens, I only use it for
> large body fields, of which each of my Documents has at most one.
>
> >
> >> I haven't thought of cases where Index or Store would legitimately vary
> >> across Fields or Documents, but am less convinced there aren't important
> >> use cases for these as well.  Similarly, although it is important to
> >> allow term vectors to be on or off at the field level, I don't see any
> >> obvious need to vary the type of term vector (positions, offsets or
> >> both).
> >
> > I think Store could definitely legitimately vary across Fields or
> > Documents for the same reason your term vectors do. Perhaps you are
> > indexing pages from the web and you want to cache only the smaller
> > pages.
>
> That's an interesting example, but not as compelling an objection to me
> (and seemingly not to you either!).  The app could always store an empty
> string without much consequence in this scenario.
>
> >
> >> There are significant benefits to global semantics, as evidenced by the
> >> fact that several of us independently came to desire this.  However,
> >> deciding what can be global and what cannot is more subtle.
> >
> > I agree. I can't see global field semantics making it into Lucene in
> > the short term. It's a rather large change, particularly if you want
> > to make full use of the performance benifits it affords.
>
> Could you summarize where these derive from?

I'm afraid I don't have time to go into detail. The main benefit comes
from having constant field numbers for each field. So when segments
merge I don't need to read in documents and term vectors and then
rewrite them to the new segment. I can just copy the data directly
from the old segment to the new segment. As far as TermInfos go the
techiques I use in Ferret probably would't translate well into Java.
But the merge model we'll be using for Lucy is Marvin Humphrey's
KinoSearch merge model which you can read about here;

    http://wiki.apache.org/jakarta-lucene/KinoSearchMergeModel

I think this would work well in Lucene. His results with KinoSearch
are very impressive.

> >
> >> Perhaps the best thing at the Lucene level is to have a notion of
> >> default semantics for a field name.  Whenever a Field of that name is
> >> constructed, those semantics would be used unless the constructor
> >> overrides them.  This would allow additional constructors on Field with
> >> simpler signatures for the common case of invariant Field properties.
> >> It would also allow applications to access the class that holds the
> >> default field information for an index.  The application will know which
> >> properties it can rely on as invariant and whether or not the set of
> >> fields is closed.
> >>
> >> This approach would preserve upward compatibility and provide, I
> >> believe, most of the benefits we all seek.
> >>
> >> Thoughts?
> >
> > If this is all you are going to add, I don't think you'd need to
> > change Lucene. You could just implement a DocumentFactory in your own
> > application. Perhaps something like this could go in the contrib
> > section of Lucene.
>
> I've already done it in my application (this weekend).  I think Lucene
> would be better with a mechanism like this built-in as field semantics
> are usually globally invariant.  I'm left wondering whether many of the
> performance optimizations you've realized might be preserved in a model
> that allowed selected exceptions, such as the term vector example.

Sure they would. As I already mentioned, most of my performance
benefits come from having constant field numbers for fields. I could
easily implement the model you've described in Ferret without a
performance hit, but I'm going to wait and see if "exceptional fields"
is a requested feature before I do.

> >
> > Also, you mentioned earlier having a field validating query parser.
> > You can already use
> > IndexWriter#getFieldNames(IndexReader.FieldOption.INDEXED) to get all
> > the indexed fields.
>
> At least in Lucene, I believe you mean IndexReader.getFieldNames().

Whoops! Yes I did.

> However, this is not the same thing.  In fact, I submitted bug fixes to
> ParallelReader a while back (now committed) that were in part due to a
> similar assumption.  The issue is that this method only finds fields
> that have already been indexed.  The model may provide fields that no
> document in a specific collection has yet used.  At least in my
> application, this distinction is important.  I have a common model used
> to build many indexes, with search and indexing performed
> simultaneously.  At any point in time in any given collection, a field
> available in the model may or may not have occurred.  Queries need to be
> validated against the model, not against the specific collection.

Sounds to me like you just need to add a validFieldNames collection in
QueryParser. I'm sure you could easily determine which field names are
valid from your common model without having to have a global field
specification within Lucene itself.

Don't get me wrong. I really like the global field spec with
exceptions idea and I personally think it would be an improvement the
current Lucene model. That's why I've done something similar in
Ferret. But Ferret is still in an alpha stage so I can afford to break
backwards compatability a little. I just think that it's a lot of work
for two little benefit and it's going to be difficult to stay
backwards compatable.

Cheers,
Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message