lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "D.L.B." <augustea...@rcn.com>
Subject Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
Date Thu, 13 Feb 2003 15:37:44 GMT
On Thursday 13 February 2003 07:49 am, Christoph Kiehl wrote:
> Tatu Saloranta wrote:
> > - Stemming can only be done for prefix queries (what is stem of,
> >   say, "hä*er"?), and even then it might not produce stem one would
> >   want. For example, for prefix query "men*" might be 'stemmed' to
> >   "man*", and user might be perplexed at why documents with
> >   words like "meningitis" and "menstrual" did not match (ok, that is
> >   a contrived example, but hope you get the idea).
>
> Good point. It's is really amazing how different and complex languages are
> ;)

Given that this is the case, I don't think it's possible to come up with a 
solution that will cover every case.  That said, I believe it is still 
worthwhile to try to do something reasonable to cover most cases.

The company I work for has public text searchable websites in the following 
languages: English, Danish, Spanish, French, Dutch, Norwegian, Finnish, and 
Swedish.  The approach we took, as I mentioned in an earlier mail, was to 
only stem prefix and "suffix" queries (of the form *someText).  In these 
cases, don't pass the wildcard character to the stemmer and only use the 
stemmed result if it is a single word.

We didn't have time to analyze all the stemming possibilities of each language 
and how our wildcard policy might perform in all cases.  Instead, we just 
threw it out there and had the native speakers run their QA and see what 
happened.

It turns out that this wildcard policy works well for us -- the users tend to 
get the results they expect.  Whatever solution falls out of this argument, I 
just wanted to mention what is working for us.  I'm thinking that adding a 
suffix term notion, parallel to prefix term in QueryParser.jj, creating 
subclassable methods to handle these, maybe providing a subclass that 
performs the imperfect stemming solution mentioned above, might be enough to 
please a lot of users.

DaveB

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message