lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Different fields in the same and index and query boosting
Date Sun, 26 Feb 2006 21:05:48 GMT

: Now my concern is: the more projects I add the more different fields
: would come into play. I would not recreate the index from scratch as I'm
: doing right now but I would only remove all documents with e.g. key
: "project1" and add the new documents completely but not touching other
: projects.

this should work ok, but one of the things you'll want to watch out for is
the fieldNorms.

Anytime you add a document with an index field, lucene by default creates
a "fieldNorm" value for *every* document in your index -- even if those
documents don't have that field.

if you are only planning on having a handfull of document types, and only
a handfull of indexed fields per doctype, then this shouldn't be a big
deal -- but if you wnat to have thousands of indexed fields per doctype,
or thousands of doctypes -- you may find your index size growing a lot
faster then you expected.

there is an option in 1.9 that lets you specify when adding an indexed
field to a document that you don't want to bother storing a fieldNorm for
that field -- if you do this length normalization won't be possible for
queries on that field, and index time field/document boosts won't work --
but if you aren't concerned about those things, it will help keep your
index size managable.

: Currently I was using query boosting extensive for the headings in HTML
: documents, e.g. title:(term)^8 h1:(term)^7 ... h6:(term)^2
: content:(term)^1 . I was wondering if this is actually necessary. The
: number of existing h1 to h6 fields with content decreases with the
: amount of documents. To give the fields title and h1, which are the most
: used ones anyway, the highest importance, to I need the boost factor
: here anyway or can I avoid them?

you should try some queries like "title:term content:term" and look at the
explain output on your matches to see how much of an impact on the final
score the various matches on title vs content have ... if there are a lot
less terms indexed in the title field then in the content field you should
see the match on title be more significant, and then you can decide how
much boost you want to give if it's not significant enough.

my question do you would be: why do you wnat to avoid using query
time boosts?  there's really no harm in using them, under the coveres an
implicit boost of 1.0f is used for every Query class (that i can think of)
so specifying your own boost value doesn't really affect the performance
of the query if that's what you are concerned about.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message