lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: product based term combination for BooleanQuery?
Date Wed, 04 Jul 2007 19:09:31 GMT
It looks like you may already be aware of this, but if not, this is
something Chris H. posted quite a while ago that I found useful....

<<<it depends on your goal.  index time field boosts are a way to express
things like "this documents title is worth twice as much as the title of
most documents" query time boosts are a way to express "i care about
matches on this clause of my query twice as much as i do about matches to
other clauses of my query.>>>

Erick

On 7/4/07, Tim Sturge <tsturge@metaweb.com> wrote:
>
>
> :-) The use of wikipedia data here is no secret; it's all over
> www.freebase.com. I just hoped to avoid being sucked into a "what is the
> best way to index wikipedia with Lucene?" discussion, which I believe
> several other groups are already tackling.
>
> At index time, I used a per document boost (over all fields) and a per
> field bost (over all documents). I can certainly factor out the first into a
> query boost, but I was under the impression that if I ever wanted to combine
> fields (eg to index all "name" "alias" and "title" data in a single "head"
> field) then I had to pre-boost the data prior to combining it. I tend to
> believe that these (short) fields contain more relevant information than
> (long) wikipedia articles or other documents.
>
> Should idf and tf take care of that short/long quality distinction? It
> sounds like you feel they should.
> I'll build an index without the per field boost and see if that produces
> improved results.
>
> Thanks,
>
> Tim
>
> ----- Original Message -----
> From: "Chris Hostetter" <hossman_lucene@fucit.org>
> To: "Lucene Users" <java-user@lucene.apache.org>
> Sent: Tuesday, July 3, 2007 10:26:57 PM (GMT-0800) America/Los_Angeles
> Subject: Re: product based term combination for BooleanQuery?
>
>
> (side note: if you are going to try and obfuscate your field names when
> sending explain output so we don't know you are using wikipedia data (not
> that we care), please at least be consistent about it so the final
> explanations actual make sense -- it will save everyone a lot of confusion
> and help us help you)
>
> the biggest factor in your scores seems to be the fieldNorms for your
> name, title and alias fields ... they are so high, that tf and idf are
> pretty much irrelevant.
>
> By the looks of it, when you were indexing your docs, you used a
> consistent field boost per field on every instance of that field for every
> document ... this is really not a use case where index time field (or
> document) boosts make sense.  in my opinion hte number one thing you can
> do to imrpove your relevency right now is to stop using index time
> boosts and use query boosts instead.
>
> If you don't want to reindex completely the LengthNormModifier class (in
> the misc contrib) can update all of your norms in place without reindexing
> and throw away any index time boosts you had.
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message