lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <ykin...@xs4all.nl>
Subject Re: sloppyFreq - why on Similarity?
Date Mon, 22 Sep 2003 04:23:31 GMT
On Monday 22 September 2003 11:15, Erik Hatcher wrote:
> On Sunday, September 21, 2003, at 03:00  PM, ykingma@xs4all.nl wrote:\
>
> > That surprises me. I would have expected that sloppyFreq() would also
> > be called for fuzzy terms. In both cases there is an distance
> > that influences the effective term frequency.
>
> WildcardQuery and FuzzyQuery do have the capability of affecting the
> scoring, although only FuzzyQuery seems to take advantage of this.
> There is no way on a Similarity implementation to affect the factors
> applied by these queries though.  So there is some inconsistency on
> these types of things.
>
> If I'm wrong about WildcardQuery, let me know, but I don't see that
> searching for "luc*" gives higher weight to "luck" than "lucene",
> although it seems that it should.  (it uses a 1.0 multiplier hardcoded
> for the boost factor of the rewritten TermQuery).
>
> > Actually I would prefer to have two different scoring
> > methods for sloppy frases and fuzzy terms.
>
> And a different one for wildcard queries?  It seems, at least to my

There is already the idf() term weight for the matching terms.
However, I'd like to have the possibility to have the same weight
for each term matching a wildcard: one use of a wildcard is
that you don't care which term matches, but you do consider
every possible match equally important.
AFAIK such equal weighting is not possible with the current
Similarity interface.

> newbie mindset, that Similarity is carrying around too much, although
> it is a one-stop place (or seems to sell itself that way) for all score
> related tweaks.  But there are exceptions like MultiTermQuery
> subclasses like WildcardQuery and FuzzyQuery.
>
> Is there a need to unify these types of tweaks into Similarity?

The nice thing about an interface is that it allows you to fall
back to a default. In that sence a unification is ok I think.

> > The Similarity interface is used for determining how similar a document
> > is to a query. I think sloppyFreq() is well placed there, given
> > the current default implementation that 'works back' from (sloppy)
> > phrase frequencies (and fuzzy term frequencies?) to normal term
> > frequencies.
>
> So now we need a getFuzzyFreq and getWildcardFreq?!  :)

Thinking about it, yes. Something like fuzzyFreq(int distance)
would make sense.
For the wildcards there is a problem, though. Currently they work
just like an OR query with idf() weights. However, i'd like the 
possibility to have the same weights for OR'ed terms, too,
just like for the wildcards I mentioned above.
It seems that this would require to add an operator to the query
language, eg. WOR, Weighted OR, which would give
all OR'ed terms the same weight.
A first step could be wildcards with equal weights,
this would at least cover the case of very low frequency
spelling errors, that currently get a too high
idf() weight.

A rather drastic way to accomplish this would be
to have idf() always return 1 leaving all the 
weighting to the query weights.



Mime
View raw message