Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <50D1DB0F.40005@ids-mannheim.de>
Date: Wed, 19 Dec 2012 16:19:43 +0100
From: Carsten Schnober <schnober@ids-mannheim.de>
Organization: Institut =?ISO-8859-15?Q?f=FCr_Deutsche_Sprache?=
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:17.0) Gecko/17.0 Thunderbird/17.0
MIME-Version: 1.0
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 8bit
Subject: Match intersection by Payload

Hi,
I have a search scenario in which I search for multiple terms and retain
only those matches that share a common payload. I'm using this  to
search for multiple terms that occur all in one sentence; I've stored a
sentence ID in the payload for each token.

So far, I've done so by specifying a list of terms, create a
BooleanQuery that connects these terms (as in ["house", "car"]) with
Occur.MUST. That BooleanQuery is wrapped into a filter.
In the next step, I perform a separate SpanQuery for each of the terms
(one for "house" and one for "car"), using the previously created
filter's DocIdSet to restrict the search to documents that contain all
of the terms, e.g. for "house":

SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper<>(new
RegexpQuery("house")).rewrite(reader);

The resulting spans are stored in a map with the terms as keys and the
matching Spans as values. Finally, I retain only those matches that have
the same payload (=sentence) in the same document.

This works well for ordinary terms and is reasonably fast since the
SpanQuerys are typically restricted to a manageable document set.
However, I would prefer to use the Lucene query language rather than
specifying a static list of terms, especially because I'd like to have
features such as regular expressions, wildcards, ranges etc.
However, this makes the above solution impossible because the
QueryParser can evaluate what is meant to be one term (e.g. "hous*")  to
multiple ones ("house", "houses"). Then, the intersection as described
above does not make sense any longer: I don't want sentences that
contain both "house" and "houses", but sentences that contain either one
and "car" too.

I have three potential solutions in my mind:

1. Track back the terms generated by a rewritten MultiTermQuery
I could try to figure out automatically whether the terms retrieved from
the StandardQueryParser should be unionised (as they are derieved from
the same term (as in "hous*") or intersected (as "hous*" and "car"). I'm
not sure how to do that reliably though because the single terms are
extracted only after generating a Query through a StandardQueryParser
and thus there is no distinction between these terms.

2. Implement my own QueryParser that makes distinguishes between terms
that are derived from one regex ("hous*") and those that are derived
from another ("car"). In that case, the scenario from 1. with unions and
intersections would be easy, logically at least.

3. Use a PayloadTermQuery. In that case, I'd hope to throw away the
apparently redundant query generation (one for the filter and one for
the SpanQuery and substitute it by a Query that makes matching payloads
a pre-condition. I'm not sure how to do that either as I don't know
beforehand which payload string to match, it just has to be the same for
the different terms.

All these ways seem equally promising (and complicated) to me, so would
you have some advice which one seems more realistic to lead to an actual
solution?

Thanks,
Carsten


-- 
Institut f�r Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der n�chsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org