lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Advice on analysis/filtering?
Date Thu, 16 Oct 2008 13:54:33 GMT
Well, let me see. Your customers are telling you, in essence,
"for any random input, you cannot return false positives". Which
is nonsense, so I'd say you need to negotiate with your
customers. I flat guarantee that, for any algorithm you try,
you can write a counter-example in, oh, 15 seconds or so <G>.

I think the best you can hope for is "reasonable results", but
getting your customers to agree to what is "reasonable" is...er...
often a challenge. Frequently when confronted by "close but
not perfect", customers aren't as unforgiving as their first
position would indicate since the inconvenience of the not-
quite-perfect results is often much less than people think
when starting out.

FuzzySearch tries to do some of this work for you, and that may be
acceptable, as this is a common issue. But it'll never be
perfect.

You might get some joy from ngrams, but I haven't
worked with it myself, just seen it recommended by people
whose opinions I respect...

Best
Erick


2008/10/16 Jarek Zgoda <jarek.zgoda@redefine.pl>

> Hello, group.
>
> I'm trying to create a search facility for documents in "broken" Polish (by
> broken I mean "not language rules compliant"), searchable by terms in
> "broken" Polish, but broken in many other ways than documents. See this
> example:
>
> document text: "włatcy móch" (in proper Polish this would be "władcy much")
> example terms that should match: "włatcy much", "wlatcy moch", "wladcy
> much"
>
> This double brokeness ruled out any Polish stemmers currently available for
> Lucene and now I am at point 0. The search results do not have to be 100%
> accurate - some missing results are acceptable, but "false positives" are
> not. Is it at all possible using machinery provided by Solr (I do not own
> PHD in liguistics), or should I ask the business for lowering their
> expectations?
>
> --
> We read Knuth so you don't have to. - Tim Peters
>
> Jarek Zgoda, R&D, Redefine
> jarek.zgoda@redefine.pl
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message