lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-5815) Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery
Date Fri, 11 Jul 2014 11:28:05 GMT
Michael McCandless created LUCENE-5815:
------------------------------------------

             Summary: Add TermAutomatonQuery, for proximity matching that generalizes MultiPhraseQuery/SpanNearQuery
                 Key: LUCENE-5815
                 URL: https://issues.apache.org/jira/browse/LUCENE-5815
             Project: Lucene - Core
          Issue Type: New Feature
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 4.10


I created a new query, called TermAutomatonQuery, that's a proximity
query to generalize MultiPhraseQuery/SpanNearQuery: it lets you
construct an arbitrary automaton whose transitions are whole terms, and
then find all documents that the automaton matches.  This is different
from a "normal" automaton whose transitions are usually
bytes/characters within a term/s.

So, if the automaton has just 1 transition, it's just an expensive
TermQuery.  If you have two transitions in sequence, it's a phrase
query of two terms.  You can express synonyms by using transitions
that overlap one another but the automaton doesn't have to be a
"sausage" (as MultiPhraseQuery requires) i.e. it "respects" posLength
(at query time).

It also allows "any" transitions, to match any term, so you can do
sloppy matching and span-like queries, e.g. find "lucene" and "python"
with up to 3 other terms in between.

I also added a class to convert a TokenStream directly to the
automaton for this query, preserving posLength.  (Of course, the index
can't store posLength, so the matching won't be fully correct if any
indexed tokens has posLength != 1).  But if you do query-time-only
synonyms then the matching should finally be correct.

I haven't tested performance but I suspect it's quite slowish ... its
cost is O(sum-totalTF) of all terms "used" in the automaton.  There
are some optimizations we could do, e.g. detecting that some terms in
the automaton can be upgraded to MUST (right now they are all
effectively SHOULD).

I'm not sure how it should assign scores (punted on that for now), but
the matching seems to be working.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message