lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Halbert <j...@su3analytics.com>
Subject Re: scoring adjacent terms without proximity search
Date Sat, 31 Oct 2009 08:38:29 GMT
Thank you all for your suggestions, I shall have a little think about
the best way forward, and report back if I do anything interesting that
works well.

In answer to Grant's question, why not use PhraseQuery,  we do not want
to have an artificial upper limit on the slop, i.e. we do want to
include documents that might only have a subset of words from the
phrase. (e.g. just cheese, or just sandwich, but not both).


-----Original Message-----
From: Robert Muir <rcmuir@gmail.com>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: scoring adjacent terms without proximity search
Date: Fri, 30 Oct 2009 16:04:03 -0400

> I suppose you could precompute the proximity associations by indexing
> n-grams (in this case, called Lucene calls them shingles), such that there
> is a single token in your index containing cheese_sandwich (effectively)
>
>
doh, I see Grant already lead you in this direction. (sorry for the
duplicate mail)
on average its worked for me for some things like this.

although, I'll try to contribute something actually useful, and mention that
if you use things like shingles, its good to consider modifying
DefaultSimilarity, look at setDiscountOverlaps param.
otherwise, i've measured cases where injecting additional tokens will cause
more harm than good, because it has an adverse affect on lengthnorm.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message