Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C389A62D7 for ; Mon, 25 Jul 2011 14:30:32 +0000 (UTC) Received: (qmail 44626 invoked by uid 500); 25 Jul 2011 14:30:30 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 44495 invoked by uid 500); 25 Jul 2011 14:30:29 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 44487 invoked by uid 99); 25 Jul 2011 14:30:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jul 2011 14:30:29 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com designates 209.85.216.169 as permitted sender) Received: from [209.85.216.169] (HELO mail-qy0-f169.google.com) (209.85.216.169) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jul 2011 14:30:23 +0000 Received: by qyk32 with SMTP id 32so1045494qyk.14 for ; Mon, 25 Jul 2011 07:30:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=wqxMtN7WRlak2icmfSixAnqUeN7RSP43rF0d87svWcs=; b=AOiKQcivNnPlQa+orPfBC2xk2CMruZVVPsdyVXpBnUkrP+AFH44ILAoDeejNFAUwwu xF2Rr9hielah2np9mgnEhwBZ76ejtPQh58tTts8P4fkaYcEau0NoTiKGgF2jJEoBSJo1 0Wgd4T8mBt/Ladx5FrzobHTVIDX0uHc3D6Sac= Received: by 10.224.218.4 with SMTP id ho4mr3509456qab.296.1311604202270; Mon, 25 Jul 2011 07:30:02 -0700 (PDT) Received: from [192.168.1.201] (ool-44c78059.dyn.optonline.net [68.199.128.89]) by mx.google.com with ESMTPS id q9sm3539626qct.32.2011.07.25.07.29.59 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 25 Jul 2011 07:30:00 -0700 (PDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1244.3) Subject: Re: Search within a sentence (revisited) From: Mark Miller In-Reply-To: Date: Mon, 25 Jul 2011 10:29:59 -0400 Content-Transfer-Encoding: 7bit Message-Id: <64BC40C9-32A6-415E-831B-396A9E869FD0@gmail.com> References: <7DD18AE8-B81B-4EFD-BD43-E6D866AF002D@gmail.com> <99EC18FA-B784-433D-A024-014694A6FD5E@gmail.com> <8329E98E-70D2-4314-A135-2FD5A699B91B@gmail.com> <7711A405-BCB2-44D6-AE0E-C3F87C61B24C@gmail.com> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1244.3) Thanks Peter - if you supply the unit tests, I'm happy to work on the fixes. I can likely look at this later today. - Mark Miller lucidimagination.com On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote: > Hi Mark, > > Sorry to bug you again, but there's another case that fails the unit test > (search within the second sentence), as shown here in the last test: > > package org.apache.lucene.search.spans; > > import java.io.Reader; > > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; > import > org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > import org.apache.lucene.index.IndexReader; > import org.apache.lucene.index.RandomIndexWriter; > import org.apache.lucene.index.Term; > import org.apache.lucene.store.Directory; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.PhraseQuery; > import org.apache.lucene.search.ScoreDoc; > import org.apache.lucene.search.TermQuery; > import org.apache.lucene.search.spans.SpanNearQuery; > import org.apache.lucene.search.spans.SpanQuery; > import org.apache.lucene.search.spans.SpanTermQuery; > import org.apache.lucene.util.LuceneTestCase; > > public class TestSentence extends LuceneTestCase { > public static final String field = "field"; > public static final String START = "^"; > public static final String END = "$"; > public void testSetPosition() throws Exception { > Analyzer analyzer = new Analyzer() { > @Override > public TokenStream tokenStream(String fieldName, Reader reader) { > return new TokenStream() { > private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END, > "9"}; > private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1}; > private int i = 0; > PositionIncrementAttribute posIncrAtt = > addAttribute(PositionIncrementAttribute.class); > CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); > OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); > @Override > public boolean incrementToken() { > assertEquals(TOKENS.length, INCREMENTS.length); > if (i == TOKENS.length) > return false; > clearAttributes(); > termAtt.append(TOKENS[i]); > offsetAtt.setOffset(i,i); > posIncrAtt.setPositionIncrement(INCREMENTS[i]); > i++; > return true; > } > }; > } > }; > Directory store = newDirectory(); > RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer); > Document d = new Document(); > d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED)); > writer.addDocument(d); > IndexReader reader = writer.getReader(); > writer.close(); > IndexSearcher searcher = newSearcher(reader); > SpanTermQuery startSentence = makeSpanTermQuery(START); > SpanTermQuery endSentence = makeSpanTermQuery(END); > SpanQuery[] clauses = new SpanQuery[2]; > clauses[0] = makeSpanTermQuery("1"); > clauses[1] = makeSpanTermQuery("2"); > SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, > false); // SpanAndQuery equivalent > SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0); > System.out.println("query: "+query); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > assertEquals(1, hits.length); > clauses[1] = makeSpanTermQuery("4"); > allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); // > SpanAndQuery equivalent > query = new SpanWithinQuery(allKeywords, endSentence, 0); > System.out.println("query: "+query); > hits = searcher.search(query, null, 1000).scoreDocs; > assertEquals(0, hits.length); > PhraseQuery pq = new PhraseQuery(); > pq.add(new Term(field, "3")); > pq.add(new Term(field, "4")); > System.out.println("query: "+pq); > hits = searcher.search(pq, null, 1000).scoreDocs; > assertEquals(1, hits.length); > clauses[0] = makeSpanTermQuery("4"); > clauses[1] = makeSpanTermQuery("6"); > allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); // > SpanAndQuery equivalent > query = new SpanWithinQuery(allKeywords, endSentence, 0); > System.out.println("query: "+query); > hits = searcher.search(query, null, 1000).scoreDocs; > assertEquals(1, hits.length); > } > > public SpanTermQuery makeSpanTermQuery(String text) { > return new SpanTermQuery(new Term(field, text)); > } > public TermQuery makeTermQuery(String text) { > return new TermQuery(new Term(field, text)); > } > } > > Peter > > On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller wrote: > >> >> I just uploaded a patch for 3X that will work for 3.2. >> >> On Jul 21, 2011, at 4:25 PM, Mark Miller wrote: >> >>> Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to >> change that to an IndexReader I believe. >>> >>> - Mark >>> >>> On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote: >>> >>>> Does this patch require the trunk version? I'm using 3.2 and >>>> 'AtomicReaderContext' isn't there. >>>> >>>> Peter >>>> >>>> On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller >> wrote: >>>> >>>>> Hey Peter, >>>>> >>>>> Getting sucked back into Spans... >>>>> >>>>> That test should pass now - I uploaded a new patch to >>>>> https://issues.apache.org/jira/browse/LUCENE-777 >>>>> >>>>> Further tests may be needed though. >>>>> >>>>> - Mark >>>>> >>>>> >>>>> On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote: >>>>> >>>>>> Hi Mark, >>>>>> >>>>>> Here is a unit test using a version of 'SpanWithinQuery' modified for >> 3.2 >>>>>> ('getTerms' removed) . The last test fails (search for "1" and "3"). >>>>>> >>>>>> package org.apache.lucene.search.spans; >>>>>> >>>>>> import java.io.Reader; >>>>>> >>>>>> import org.apache.lucene.analysis.Analyzer; >>>>>> import org.apache.lucene.analysis.TokenStream; >>>>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; >>>>>> import >>>>>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; >>>>>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; >>>>>> import org.apache.lucene.document.Document; >>>>>> import org.apache.lucene.document.Field; >>>>>> import org.apache.lucene.index.IndexReader; >>>>>> import org.apache.lucene.index.RandomIndexWriter; >>>>>> import org.apache.lucene.index.Term; >>>>>> import org.apache.lucene.store.Directory; >>>>>> import org.apache.lucene.search.IndexSearcher; >>>>>> import org.apache.lucene.search.PhraseQuery; >>>>>> import org.apache.lucene.search.ScoreDoc; >>>>>> import org.apache.lucene.search.TermQuery; >>>>>> import org.apache.lucene.search.spans.SpanNearQuery; >>>>>> import org.apache.lucene.search.spans.SpanQuery; >>>>>> import org.apache.lucene.search.spans.SpanTermQuery; >>>>>> import org.apache.lucene.util.LuceneTestCase; >>>>>> >>>>>> public class TestSentence extends LuceneTestCase { >>>>>> public static final String field = "field"; >>>>>> public static final String START = "^"; >>>>>> public static final String END = "$"; >>>>>> public void testSetPosition() throws Exception { >>>>>> Analyzer analyzer = new Analyzer() { >>>>>> @Override >>>>>> public TokenStream tokenStream(String fieldName, Reader reader) { >>>>>> return new TokenStream() { >>>>>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", >> END, >>>>>> "9"}; >>>>>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1}; >>>>>> private int i = 0; >>>>>> >>>>>> PositionIncrementAttribute posIncrAtt = >>>>>> addAttribute(PositionIncrementAttribute.class); >>>>>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); >>>>>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); >>>>>> >>>>>> @Override >>>>>> public boolean incrementToken() { >>>>>> assertEquals(TOKENS.length, INCREMENTS.length); >>>>>> if (i == TOKENS.length) >>>>>> return false; >>>>>> clearAttributes(); >>>>>> termAtt.append(TOKENS[i]); >>>>>> offsetAtt.setOffset(i,i); >>>>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]); >>>>>> i++; >>>>>> return true; >>>>>> } >>>>>> }; >>>>>> } >>>>>> }; >>>>>> Directory store = newDirectory(); >>>>>> RandomIndexWriter writer = new RandomIndexWriter(random, store, >>>>> analyzer); >>>>>> Document d = new Document(); >>>>>> d.add(newField("field", "bogus", Field.Store.YES, >> Field.Index.ANALYZED)); >>>>>> writer.addDocument(d); >>>>>> IndexReader reader = writer.getReader(); >>>>>> writer.close(); >>>>>> IndexSearcher searcher = newSearcher(reader); >>>>>> >>>>>> SpanTermQuery startSentence = makeSpanTermQuery(START); >>>>>> SpanTermQuery endSentence = makeSpanTermQuery(END); >>>>>> SpanQuery[] clauses = new SpanQuery[2]; >>>>>> clauses[0] = makeSpanTermQuery("1"); >>>>>> clauses[1] = makeSpanTermQuery("2"); >>>>>> SpanNearQuery allKeywords = new SpanNearQuery(clauses, >> Integer.MAX_VALUE, >>>>>> false); // SpanAndQuery equivalent >>>>>> SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, >> 0); >>>>>> System.out.println("query: "+query); >>>>>> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; >>>>>> assertEquals(hits.length, 1); >>>>>> >>>>>> clauses[1] = makeSpanTermQuery("4"); >>>>>> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); // >>>>>> SpanAndQuery equivalent >>>>>> query = new SpanWithinQuery(allKeywords, endSentence, 0); >>>>>> System.out.println("query: "+query); >>>>>> hits = searcher.search(query, null, 1000).scoreDocs; >>>>>> assertEquals(hits.length, 0); >>>>>> >>>>>> PhraseQuery pq = new PhraseQuery(); >>>>>> pq.add(new Term(field, "3")); >>>>>> pq.add(new Term(field, "4")); >>>>>> hits = searcher.search(pq, null, 1000).scoreDocs; >>>>>> assertEquals(hits.length, 1); >>>>>> >>>>>> clauses[1] = makeSpanTermQuery("3"); >>>>>> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); // >>>>>> SpanAndQuery equivalent >>>>>> query = new SpanWithinQuery(allKeywords, endSentence, 0); >>>>>> System.out.println("query: "+query); >>>>>> hits = searcher.search(query, null, 1000).scoreDocs; >>>>>> assertEquals(hits.length, 1); >>>>>> >>>>>> >>>>>> } >>>>>> >>>>>> public SpanTermQuery makeSpanTermQuery(String text) { >>>>>> return new SpanTermQuery(new Term(field, text)); >>>>>> } >>>>>> public TermQuery makeTermQuery(String text) { >>>>>> return new TermQuery(new Term(field, text)); >>>>>> } >>>>>> } >>>>>> >>>>>> Peter >>>>>> >>>>>> On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller >>>>> wrote: >>>>>> >>>>>>> >>>>>>> On Jul 20, 2011, at 7:44 PM, Mark Miller wrote: >>>>>>> >>>>>>>> >>>>>>>> On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote: >>>>>>>> >>>>>>>>> Mark Miller's 'SpanWithinQuery' patch >>>>>>>>> seems to have the same issue. >>>>>>>> >>>>>>>> If I remember right (It's been more the a couple years), I did index >>>>> the >>>>>>> sentence markers at the same position as the last word in the >> sentence. >>>>> And >>>>>>> I think the limitation that I ate was that the word could belong to >> both >>>>>>> it's true sentence, and the one after it. >>>>>>>> >>>>>>>> - Mark Miller >>>>>>>> lucidimagination.com >>>>>>> >>>>>>> Perhaps you could index the sentence marker at both the last word of >> the >>>>>>> sentence as well as the first word of the next sentence if there is >> one. >>>>>>> This would seem to solve the above limitation as well? >>>>>>> >>>>>>> - Mark Miller >>>>>>> lucidimagination.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>>> >>>>>>> >>>>> >>>>> - Mark Miller >>>>> lucidimagination.com >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>> >>>>> >>> >>> - Mark Miller >>> lucidimagination.com >>> >>> >>> >>> >>> >>> >>> >>> >> >> - Mark Miller >> lucidimagination.com >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org