lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 김한규 <gksr...@gmail.com>
Subject Re: NGramPhraseQuery with missing terms
Date Thu, 20 Dec 2012 10:00:31 GMT
Thanks for the reply.

I actually solved the issue by overriding setFreqCurrentDoc() function of
SpanScorer to give boost (by adding extra frequency) if the span positions
are found within chosen distance after one another. I had to override
SpanQuery and SpanWeight as well, just to accept multiple spans instead of
one. I don't think this is the neatest solution to override three classes,
so I'll gladly accept any advice or tutorial on writing custom scorer.

For the discussion, the problem was not with BooleanQuery, but
SpanNearQuery which can only compare distance between two terms at a time.
So even if there are multiple terms, I will have to pair them up two at a
time to know if they are near by each other. PhraseQuery doesn't work here
because it discards the document right away, once a single term is not
found within the document.

I thought it would be a better implementation to ignore the term context
and only compare their positions. My logic goes: It's unlikely for the same
NGram to appear repetively. So if positions of any of the matching NGrams
are very close to each other, then it's probable to be the matching
word/phrase. Even if they are in a wrong order, it's a fuzzy match. So as
long as NGrams appear near each other I should boost its scores anyway.


2012/12/19 Jack Krupansky <jack@basetechnology.com>

> "a BooleanQuery, but it requires me to consider every possible pair of
> terms (since any one of the terms could be missing)"
>
> What about setting minMatch and all the terms as "SHOULD" - and then
> minMatch could be tuned for how many missing terms to tolerate?
>
> See:
> http://lucene.apache.org/core/**4_0_0/core/org/apache/lucene/**
> search/BooleanQuery.html#**setMinimumNumberShouldMatch(**int)<http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/BooleanQuery.html#setMinimumNumberShouldMatch(int)>
>
> -- Jack Krupansky
>
> -----Original Message----- From: 김한규
> Sent: Wednesday, December 19, 2012 2:36 AM
> To: java-user@lucene.apache.org
> Subject: NGramPhraseQuery with missing terms
>
>
> Hi.
>
> I am trying to make a NGramPhrase query that could tolerate terms missing,
> so even if one of the NGrams doesn't match it still gets picked up by
> search.
> I know I could use the combination of normal SpanNearQuery and a
> BooleanQuery, but it requires me to consider every possible pair of terms
> (since any one of the terms could be missing) and it gets too messy and
> expensive.
>
> What I want to try is to use SpanTermQuery to get the positions of the
> mathcing NGrams and list the spans' position informations in an order, so
> that I could pick up any two or more spans near each other to score them
> accordingly, but I can't figure out how can I combine the spans.
>
> Any help in solving this issue is appreciated. Also, if there is an example
> of a simple scoring implementation example that combines multiple queries'
> results, it would be very nice.
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<java-user-help@lucene.apache.org>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message