lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Advice on analysis/filtering?
Date Thu, 16 Oct 2008 14:21:20 GMT

On Oct 16, 2008, at 3:07 AM, Jarek Zgoda wrote:

> Hello, group.
> I'm trying to create a search facility for documents in "broken"  
> Polish (by broken I mean "not language rules compliant"),

Can you explain what you mean here a bit more?  I don't know Polish,  
but most spoken languages can't be pinned down to a specific set of  
rules.  In other words, the exception is the rule.  Or, are you saying  
the documents use more dialog based, i.e. more informal, as in two  
people having a conversation?

> searchable by terms in "broken" Polish, but broken in many other  
> ways than documents. See this example:
> document text: "włatcy móch" (in proper Polish this would be  
> "władcy much")
> example terms that should match: "włatcy much", "wlatcy moch",  
> "wladcy much"
> This double brokeness ruled out any Polish stemmers currently  
> available for Lucene and now I am at point 0. The search results do  
> not have to be 100% accurate - some missing results are acceptable,
> but "false positives" are not.

There's no such thing in any language.  In your example above, what is  
matching that shouldn't?  Is this happening across a lot of documents,  
or just a few?

> Is it at all possible using machinery provided by Solr (I do not own  
> PHD in liguistics), or should I ask the business for lowering their  
> expectations?

Well, I think there are a couple of approaches:
1. You can write your own filter/stemmer/analyzer that you think fixes  
these issues
2. You can protect the "broken" words and not have them filtered, or  
filter them differently.
3. You can lower expectations.

One thing to try out is Solr's analysis tool in the admin, and see if  
you can get a better handle on what is going wrong.

Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.

Lucene Helpful Hints:

View raw message