lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Check my thinking on this, wildcard matching in phrases.
Date Fri, 14 Mar 2014 13:35:54 GMT
Ahmet:

I saw your patch updating to 4.7. I have a long plane ride this
afternoon that I hope to use to look at it more closely. Thanks for
updating it!

And thanks for your comment on putting the $ in the full token, I
hadn't thought of that, but I think you're absolutely right.

Thanks....

On Fri, Mar 14, 2014 at 4:50 AM, Ahmet Arslan <iorixxx@yahoo.com> wrote:
> Hi Erick,
>
> I think it'a very good idea.
>
> What happens when you search "my$ dog$"? I think it does not retrieve your example document.
> Since * means zero or more chars, I wonder that would be the expected behaviour.
>
> If you inject last token with and without $, would that harm anything?  d$ do$ dog$ dog
>
> Erick, what do you think about LUCENE-5205? It is replacement candidate for Surround
and ComplexPhrase. It has non of their weaknesses. And its author Tim Allison responds very
fast to any comments/questions/improvements/bugs etc. By the way SOLR-5410 is the wrapper
for LUCENE-5205.
>
> Ahmet
>
>
>
> On Friday, March 14, 2014 3:38 AM, Erick Erickson <erickerickson@gmail.com> wrote:
> or "why haven't I thought of this before"?
>
> I'm once again being faced with the recurring problem of phrase
> searches with wildcards. It'll lead to index bloat, but that's
> acceptable in this situation, at least until proved not so.
>
> The surround query parser can deal with wildcards and proximith, but
> it doesn't accept anything less than three leading characters, which
> is another problem in this case.
>
> I know the complex phrase query parser is out there, but it's not part
> of the code base.
>
> So I'm thinking of modifying the EdgeNGramFilter, I've coded up a
> prototype that seems to work. Basically, it just appends $ to all the
> grams _except_ the last one. I set maxGramSize to 1000, so we'll
> assume the final gram is the original term.
>
> So, indexing "my dog has fleas" I get
> pos 1 pos 2 pos 3   pos 4
> m$      d$         h$      f$
> my      do$       ha$    fl$
>            dog       has     fle$
>                                     flea$
>                                     fleas
>
>
> Now, when users want to search for "m* fleas" within 5 words, they can
> search for :
> "m$ fleas"~5
> or
> "m$ fle$"~5
> or even
> "m$ do$ fle$"~3
>
>
> and they won't get false matches on something like
> "do ha"
>
> You have to accept some simplifications here, of course. This doesn't
> handle things like "fle*s" and the like.
>
> I'm also not sure this is general-purpose enough to make an option for
> EdgeNGramFilterFactory, the use-case is somewhat restricted. But
> that's a relatively natural fit, a new param like
> 'subGramAppendChar="$" '
>
> Thoughts?
>

Mime
View raw message