Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6ED9CE6BC for ; Wed, 19 Dec 2012 15:20:15 +0000 (UTC) Received: (qmail 67731 invoked by uid 500); 19 Dec 2012 15:20:13 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 67456 invoked by uid 500); 19 Dec 2012 15:20:13 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 67425 invoked by uid 99); 19 Dec 2012 15:20:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Dec 2012 15:20:12 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [193.196.8.10] (HELO linux3.ids-mannheim.de) (193.196.8.10) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Dec 2012 15:20:06 +0000 Received: from linux2.ids-mannheim.de ([10.0.1.1]) by linux3.ids-mannheim.de with smtp (Exim 4.72) (envelope-from ) id 1TlLQt-0002bc-ME for java-user@lucene.apache.org; Wed, 19 Dec 2012 16:19:44 +0100 Received: (qmail 30097 invoked from network); 19 Dec 2012 15:19:47 -0000 Received: from unknown (HELO ?10.99.1.49?) (10.99.1.49) by linux2.ids-mannheim.de with SMTP; 19 Dec 2012 15:19:47 -0000 Message-ID: <50D1DB0F.40005@ids-mannheim.de> Date: Wed, 19 Dec 2012 16:19:43 +0100 From: Carsten Schnober Organization: Institut =?ISO-8859-15?Q?f=FCr_Deutsche_Sprache?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 8bit X-SA-Do-Not-Run: Yes X-SA-Exim-Connect-IP: 10.0.1.1 X-SA-Exim-Rcpt-To: java-user@lucene.apache.org X-SA-Exim-Mail-From: schnober@ids-mannheim.de X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on linux3.ids-mannheim.de X-Spam-Level: Subject: Match intersection by Payload X-SA-Exim-Version: 4.2.1 (built Mon, 03 Jul 2006 09:34:15 +0200) X-SA-Exim-Scanned: Yes (on linux3.ids-mannheim.de) X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-2.1 required=3.0 tests=BAYES_00,GREYLIST_ISWHITE, RDNS_NONE,TO_NO_BRKTS_NORDNS autolearn=no version=3.3.2 Hi, I have a search scenario in which I search for multiple terms and retain only those matches that share a common payload. I'm using this to search for multiple terms that occur all in one sentence; I've stored a sentence ID in the payload for each token. So far, I've done so by specifying a list of terms, create a BooleanQuery that connects these terms (as in ["house", "car"]) with Occur.MUST. That BooleanQuery is wrapped into a filter. In the next step, I perform a separate SpanQuery for each of the terms (one for "house" and one for "car"), using the previously created filter's DocIdSet to restrict the search to documents that contain all of the terms, e.g. for "house": SpanQuery sq = (SpanQuery) new SpanMultiTermQueryWrapper<>(new RegexpQuery("house")).rewrite(reader); The resulting spans are stored in a map with the terms as keys and the matching Spans as values. Finally, I retain only those matches that have the same payload (=sentence) in the same document. This works well for ordinary terms and is reasonably fast since the SpanQuerys are typically restricted to a manageable document set. However, I would prefer to use the Lucene query language rather than specifying a static list of terms, especially because I'd like to have features such as regular expressions, wildcards, ranges etc. However, this makes the above solution impossible because the QueryParser can evaluate what is meant to be one term (e.g. "hous*") to multiple ones ("house", "houses"). Then, the intersection as described above does not make sense any longer: I don't want sentences that contain both "house" and "houses", but sentences that contain either one and "car" too. I have three potential solutions in my mind: 1. Track back the terms generated by a rewritten MultiTermQuery I could try to figure out automatically whether the terms retrieved from the StandardQueryParser should be unionised (as they are derieved from the same term (as in "hous*") or intersected (as "hous*" and "car"). I'm not sure how to do that reliably though because the single terms are extracted only after generating a Query through a StandardQueryParser and thus there is no distinction between these terms. 2. Implement my own QueryParser that makes distinguishes between terms that are derived from one regex ("hous*") and those that are derived from another ("car"). In that case, the scenario from 1. with unions and intersections would be easy, logically at least. 3. Use a PayloadTermQuery. In that case, I'd hope to throw away the apparently redundant query generation (one for the filter and one for the SpanQuery and substitute it by a Query that makes matching payloads a pre-condition. I'm not sure how to do that either as I don't know beforehand which payload string to match, it just has to be the same for the different terms. All these ways seem equally promising (and complicated) to me, so would you have some advice which one seems more realistic to lead to an actual solution? Thanks, Carsten -- Institut f�r Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schnober@ids-mannheim.de Korpusanalyseplattform der n�chsten Generation Next Generation Corpus Analysis Platform --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org