lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: wildcard and span queries
Date Mon, 09 Oct 2006 23:39:46 GMT
Doron:

Thanks for the suggestion, I'll certainly put it on my list, depending upon
what the PM decides. This app is geneaology reasearch, and users *can* put
in their own wildcards...

This is why I love this list... lots of smart people giving me suggestions I
never would have thought of <G>...

Thanks
Erick

On 10/9/06, Doron Cohen <DORONC@il.ibm.com> wrote:
>
> "Erick Erickson" <erickerickson@gmail.com> wrote on 09/10/2006 13:09:21:
> > ... The kicker is that what we are indexing is
> > OCR data, some of which is pretty trashy. So you wind up with
> "interesting"
> > words in your index, things like rtyHrS. So the whole question of
> allowing
> > very specific queries on detailed wildcards (combined with spans) is
> under
> > discussion. It's not at all clear to me that there's any value to the
> end
> > users in the capability of, say, two character prefixes. And, it's an
> easy
> > rule that "prefix queries must specify at least 3 non-wildcard
> > characters"....
>
> Erick, I may be out of course here, but, fwiw, have you considered n-gram
> indexing/search for a degree of fuzziness to compensate for OCR errors..?
>
> For a four words query you would probably get ~20 tokens (bigrams?) - no
> matter what the index size is. You would then probably want to score
> higher
> by LA (lexical affinity - query terms appear close to each other in the
> document) - and I am not sure to what degree a span query (made of n-gram
> terms) would serve that, because (1) all terms in the span need to be
> there
> (well, I think:-); and, (2) you would like to increase doc score for
> close-by terms only for close-by query n-grams.
>
> So there might not be a ready to use solution in Lucene for this, but
> perhaps this is a more robust direction to try than the wild card approach
>
> - I mean, if users want to type a wild card query, it is their right to do
> so, but for an application logic this does not seem the best choice.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message