Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 434617F44 for ; Tue, 26 Jul 2011 13:12:23 +0000 (UTC) Received: (qmail 74343 invoked by uid 500); 26 Jul 2011 13:12:21 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 74228 invoked by uid 500); 26 Jul 2011 13:12:20 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 74220 invoked by uid 99); 26 Jul 2011 13:12:20 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jul 2011 13:12:20 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Jul 2011 13:12:13 +0000 Received: by vws7 with SMTP id 7so421487vws.35 for ; Tue, 26 Jul 2011 06:11:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=KRkUPh8h9P6Iykx3OXei8okiIaBNbU2PHa9FSJygOnM=; b=C9smZAw6ypLWb4RHlvAXuH9KrrgtrisyOCGxnynsQgEwwLQwrId1hu7GNLkWIhlJh3 PaewEVH6vCrjOjgytVQrW2gTutsNVwkUUWzkhptdwga5kn/2zbF68bLyI0NRiU6JpOrR D0qAz+1193x5m6QKXHvS/jABEy+ksxyokWwgU= Received: by 10.52.93.72 with SMTP id cs8mr5568974vdb.518.1311685912876; Tue, 26 Jul 2011 06:11:52 -0700 (PDT) Received: from [192.168.1.201] (ool-44c78059.dyn.optonline.net [68.199.128.89]) by mx.google.com with ESMTPS id eq10sm193427vdb.16.2011.07.26.06.11.51 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 26 Jul 2011 06:11:51 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Apple Message framework v1244.3) Subject: Re: Search within a sentence (revisited) From: Mark Miller In-Reply-To: Date: Tue, 26 Jul 2011 09:11:51 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: <7DD18AE8-B81B-4EFD-BD43-E6D866AF002D@gmail.com> <99EC18FA-B784-433D-A024-014694A6FD5E@gmail.com> <8329E98E-70D2-4314-A135-2FD5A699B91B@gmail.com> <7711A405-BCB2-44D6-AE0E-C3F87C61B24C@gmail.com> <64BC40C9-32A6-415E-831B-396A9E869FD0@gmail.com> <09F52BB3-B980-4200-BB30-0130B56CF5B7@gmail.com> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1244.3) As long as you are happy with the results, I'm good. Always nice to have = an excuse to dip back into Lucene. Just don't want you to feel over = confident with the code without proper testing of it - I coded to fix = the broken tests rather than taking the time to write a bunch more = corner case tests like I likely should try if I was going to commit this = thing. - Mark Miller lucidimagination.com On Jul 26, 2011, at 8:56 AM, Peter Keegan wrote: > Thanks Mark! The new patch is working fine with the tests and a few = more. If > you have particular test cases in mind, I'd be happy to add them. >=20 > Thanks, > Peter >=20 > On Mon, Jul 25, 2011 at 5:56 PM, Mark Miller = wrote: >=20 >> Sorry Peter - I introduced this problem with some kind of typo type = issue - >> I somehow changed an includeSpans variable to excludeSpans - but I = certainly >> didn't mean too - it makes no sense. So not sure how it happened, and >> surprised the tests that passed still passed! >>=20 >> We could probably use even more tests before feeling too confident = here=85 >>=20 >> I've attached a patch for 3X with the new test and fix (changed that >> include back to exclude). >>=20 >> - Mark Miller >> lucidimagination.com >>=20 >> On Jul 25, 2011, at 10:29 AM, Mark Miller wrote: >>=20 >>> Thanks Peter - if you supply the unit tests, I'm happy to work on = the >> fixes. >>>=20 >>> I can likely look at this later today. >>>=20 >>> - Mark Miller >>> lucidimagination.com >>>=20 >>> On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote: >>>=20 >>>> Hi Mark, >>>>=20 >>>> Sorry to bug you again, but there's another case that fails the = unit >> test >>>> (search within the second sentence), as shown here in the last = test: >>>>=20 >>>> package org.apache.lucene.search.spans; >>>>=20 >>>> import java.io.Reader; >>>>=20 >>>> import org.apache.lucene.analysis.Analyzer; >>>> import org.apache.lucene.analysis.TokenStream; >>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; >>>> import >>>> = org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; >>>> import = org.apache.lucene.analysis.tokenattributes.CharTermAttribute; >>>> import org.apache.lucene.document.Document; >>>> import org.apache.lucene.document.Field; >>>> import org.apache.lucene.index.IndexReader; >>>> import org.apache.lucene.index.RandomIndexWriter; >>>> import org.apache.lucene.index.Term; >>>> import org.apache.lucene.store.Directory; >>>> import org.apache.lucene.search.IndexSearcher; >>>> import org.apache.lucene.search.PhraseQuery; >>>> import org.apache.lucene.search.ScoreDoc; >>>> import org.apache.lucene.search.TermQuery; >>>> import org.apache.lucene.search.spans.SpanNearQuery; >>>> import org.apache.lucene.search.spans.SpanQuery; >>>> import org.apache.lucene.search.spans.SpanTermQuery; >>>> import org.apache.lucene.util.LuceneTestCase; >>>>=20 >>>> public class TestSentence extends LuceneTestCase { >>>> public static final String field =3D "field"; >>>> public static final String START =3D "^"; >>>> public static final String END =3D "$"; >>>> public void testSetPosition() throws Exception { >>>> Analyzer analyzer =3D new Analyzer() { >>>> @Override >>>> public TokenStream tokenStream(String fieldName, Reader reader) { >>>> return new TokenStream() { >>>> private final String[] TOKENS =3D {"1", "2", "3", END, "4", "5", = "6", END, >>>> "9"}; >>>> private final int[] INCREMENTS =3D {1,1,1,0,1,1,1,0,1}; >>>> private int i =3D 0; >>>> PositionIncrementAttribute posIncrAtt =3D >>>> addAttribute(PositionIncrementAttribute.class); >>>> CharTermAttribute termAtt =3D = addAttribute(CharTermAttribute.class); >>>> OffsetAttribute offsetAtt =3D addAttribute(OffsetAttribute.class); >>>> @Override >>>> public boolean incrementToken() { >>>> assertEquals(TOKENS.length, INCREMENTS.length); >>>> if (i =3D=3D TOKENS.length) >>>> return false; >>>> clearAttributes(); >>>> termAtt.append(TOKENS[i]); >>>> offsetAtt.setOffset(i,i); >>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]); >>>> i++; >>>> return true; >>>> } >>>> }; >>>> } >>>> }; >>>> Directory store =3D newDirectory(); >>>> RandomIndexWriter writer =3D new RandomIndexWriter(random, store, >> analyzer); >>>> Document d =3D new Document(); >>>> d.add(newField("field", "bogus", Field.Store.YES, >> Field.Index.ANALYZED)); >>>> writer.addDocument(d); >>>> IndexReader reader =3D writer.getReader(); >>>> writer.close(); >>>> IndexSearcher searcher =3D newSearcher(reader); >>>> SpanTermQuery startSentence =3D makeSpanTermQuery(START); >>>> SpanTermQuery endSentence =3D makeSpanTermQuery(END); >>>> SpanQuery[] clauses =3D new SpanQuery[2]; >>>> clauses[0] =3D makeSpanTermQuery("1"); >>>> clauses[1] =3D makeSpanTermQuery("2"); >>>> SpanNearQuery allKeywords =3D new SpanNearQuery(clauses, >> Integer.MAX_VALUE, >>>> false); // SpanAndQuery equivalent >>>> SpanWithinQuery query =3D new SpanWithinQuery(allKeywords, = endSentence, >> 0); >>>> System.out.println("query: "+query); >>>> ScoreDoc[] hits =3D searcher.search(query, null, 1000).scoreDocs; >>>> assertEquals(1, hits.length); >>>> clauses[1] =3D makeSpanTermQuery("4"); >>>> allKeywords =3D new SpanNearQuery(clauses, Integer.MAX_VALUE, = false); // >>>> SpanAndQuery equivalent >>>> query =3D new SpanWithinQuery(allKeywords, endSentence, 0); >>>> System.out.println("query: "+query); >>>> hits =3D searcher.search(query, null, 1000).scoreDocs; >>>> assertEquals(0, hits.length); >>>> PhraseQuery pq =3D new PhraseQuery(); >>>> pq.add(new Term(field, "3")); >>>> pq.add(new Term(field, "4")); >>>> System.out.println("query: "+pq); >>>> hits =3D searcher.search(pq, null, 1000).scoreDocs; >>>> assertEquals(1, hits.length); >>>> clauses[0] =3D makeSpanTermQuery("4"); >>>> clauses[1] =3D makeSpanTermQuery("6"); >>>> allKeywords =3D new SpanNearQuery(clauses, Integer.MAX_VALUE, = false); // >>>> SpanAndQuery equivalent >>>> query =3D new SpanWithinQuery(allKeywords, endSentence, 0); >>>> System.out.println("query: "+query); >>>> hits =3D searcher.search(query, null, 1000).scoreDocs; >>>> assertEquals(1, hits.length); >>>> } >>>>=20 >>>> public SpanTermQuery makeSpanTermQuery(String text) { >>>> return new SpanTermQuery(new Term(field, text)); >>>> } >>>> public TermQuery makeTermQuery(String text) { >>>> return new TermQuery(new Term(field, text)); >>>> } >>>> } >>>>=20 >>>> Peter >>>>=20 >>>> On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller = >> wrote: >>>>=20 >>>>>=20 >>>>> I just uploaded a patch for 3X that will work for 3.2. >>>>>=20 >>>>> On Jul 21, 2011, at 4:25 PM, Mark Miller wrote: >>>>>=20 >>>>>> Yeah, it's off trunk - I'll submit a 3X patch in a bit - just = have to >>>>> change that to an IndexReader I believe. >>>>>>=20 >>>>>> - Mark >>>>>>=20 >>>>>> On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote: >>>>>>=20 >>>>>>> Does this patch require the trunk version? I'm using 3.2 and >>>>>>> 'AtomicReaderContext' isn't there. >>>>>>>=20 >>>>>>> Peter >>>>>>>=20 >>>>>>> On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller = >>>>> wrote: >>>>>>>=20 >>>>>>>> Hey Peter, >>>>>>>>=20 >>>>>>>> Getting sucked back into Spans... >>>>>>>>=20 >>>>>>>> That test should pass now - I uploaded a new patch to >>>>>>>> https://issues.apache.org/jira/browse/LUCENE-777 >>>>>>>>=20 >>>>>>>> Further tests may be needed though. >>>>>>>>=20 >>>>>>>> - Mark >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote: >>>>>>>>=20 >>>>>>>>> Hi Mark, >>>>>>>>>=20 >>>>>>>>> Here is a unit test using a version of 'SpanWithinQuery' = modified >> for >>>>> 3.2 >>>>>>>>> ('getTerms' removed) . The last test fails (search for "1" and >> "3"). >>>>>>>>>=20 >>>>>>>>> package org.apache.lucene.search.spans; >>>>>>>>>=20 >>>>>>>>> import java.io.Reader; >>>>>>>>>=20 >>>>>>>>> import org.apache.lucene.analysis.Analyzer; >>>>>>>>> import org.apache.lucene.analysis.TokenStream; >>>>>>>>> import = org.apache.lucene.analysis.tokenattributes.OffsetAttribute; >>>>>>>>> import >>>>>>>>>=20 >> = org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; >>>>>>>>> import >> org.apache.lucene.analysis.tokenattributes.CharTermAttribute; >>>>>>>>> import org.apache.lucene.document.Document; >>>>>>>>> import org.apache.lucene.document.Field; >>>>>>>>> import org.apache.lucene.index.IndexReader; >>>>>>>>> import org.apache.lucene.index.RandomIndexWriter; >>>>>>>>> import org.apache.lucene.index.Term; >>>>>>>>> import org.apache.lucene.store.Directory; >>>>>>>>> import org.apache.lucene.search.IndexSearcher; >>>>>>>>> import org.apache.lucene.search.PhraseQuery; >>>>>>>>> import org.apache.lucene.search.ScoreDoc; >>>>>>>>> import org.apache.lucene.search.TermQuery; >>>>>>>>> import org.apache.lucene.search.spans.SpanNearQuery; >>>>>>>>> import org.apache.lucene.search.spans.SpanQuery; >>>>>>>>> import org.apache.lucene.search.spans.SpanTermQuery; >>>>>>>>> import org.apache.lucene.util.LuceneTestCase; >>>>>>>>>=20 >>>>>>>>> public class TestSentence extends LuceneTestCase { >>>>>>>>> public static final String field =3D "field"; >>>>>>>>> public static final String START =3D "^"; >>>>>>>>> public static final String END =3D "$"; >>>>>>>>> public void testSetPosition() throws Exception { >>>>>>>>> Analyzer analyzer =3D new Analyzer() { >>>>>>>>> @Override >>>>>>>>> public TokenStream tokenStream(String fieldName, Reader = reader) { >>>>>>>>> return new TokenStream() { >>>>>>>>> private final String[] TOKENS =3D {"1", "2", "3", END, "4", = "5", "6", >>>>> END, >>>>>>>>> "9"}; >>>>>>>>> private final int[] INCREMENTS =3D {1,1,1,0,1,1,1,0,1}; >>>>>>>>> private int i =3D 0; >>>>>>>>>=20 >>>>>>>>> PositionIncrementAttribute posIncrAtt =3D >>>>>>>>> addAttribute(PositionIncrementAttribute.class); >>>>>>>>> CharTermAttribute termAtt =3D = addAttribute(CharTermAttribute.class); >>>>>>>>> OffsetAttribute offsetAtt =3D = addAttribute(OffsetAttribute.class); >>>>>>>>>=20 >>>>>>>>> @Override >>>>>>>>> public boolean incrementToken() { >>>>>>>>> assertEquals(TOKENS.length, INCREMENTS.length); >>>>>>>>> if (i =3D=3D TOKENS.length) >>>>>>>>> return false; >>>>>>>>> clearAttributes(); >>>>>>>>> termAtt.append(TOKENS[i]); >>>>>>>>> offsetAtt.setOffset(i,i); >>>>>>>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]); >>>>>>>>> i++; >>>>>>>>> return true; >>>>>>>>> } >>>>>>>>> }; >>>>>>>>> } >>>>>>>>> }; >>>>>>>>> Directory store =3D newDirectory(); >>>>>>>>> RandomIndexWriter writer =3D new RandomIndexWriter(random, = store, >>>>>>>> analyzer); >>>>>>>>> Document d =3D new Document(); >>>>>>>>> d.add(newField("field", "bogus", Field.Store.YES, >>>>> Field.Index.ANALYZED)); >>>>>>>>> writer.addDocument(d); >>>>>>>>> IndexReader reader =3D writer.getReader(); >>>>>>>>> writer.close(); >>>>>>>>> IndexSearcher searcher =3D newSearcher(reader); >>>>>>>>>=20 >>>>>>>>> SpanTermQuery startSentence =3D makeSpanTermQuery(START); >>>>>>>>> SpanTermQuery endSentence =3D makeSpanTermQuery(END); >>>>>>>>> SpanQuery[] clauses =3D new SpanQuery[2]; >>>>>>>>> clauses[0] =3D makeSpanTermQuery("1"); >>>>>>>>> clauses[1] =3D makeSpanTermQuery("2"); >>>>>>>>> SpanNearQuery allKeywords =3D new SpanNearQuery(clauses, >>>>> Integer.MAX_VALUE, >>>>>>>>> false); // SpanAndQuery equivalent >>>>>>>>> SpanWithinQuery query =3D new SpanWithinQuery(allKeywords, >> endSentence, >>>>> 0); >>>>>>>>> System.out.println("query: "+query); >>>>>>>>> ScoreDoc[] hits =3D searcher.search(query, null, = 1000).scoreDocs; >>>>>>>>> assertEquals(hits.length, 1); >>>>>>>>>=20 >>>>>>>>> clauses[1] =3D makeSpanTermQuery("4"); >>>>>>>>> allKeywords =3D new SpanNearQuery(clauses, Integer.MAX_VALUE, = false); >> // >>>>>>>>> SpanAndQuery equivalent >>>>>>>>> query =3D new SpanWithinQuery(allKeywords, endSentence, 0); >>>>>>>>> System.out.println("query: "+query); >>>>>>>>> hits =3D searcher.search(query, null, 1000).scoreDocs; >>>>>>>>> assertEquals(hits.length, 0); >>>>>>>>>=20 >>>>>>>>> PhraseQuery pq =3D new PhraseQuery(); >>>>>>>>> pq.add(new Term(field, "3")); >>>>>>>>> pq.add(new Term(field, "4")); >>>>>>>>> hits =3D searcher.search(pq, null, 1000).scoreDocs; >>>>>>>>> assertEquals(hits.length, 1); >>>>>>>>>=20 >>>>>>>>> clauses[1] =3D makeSpanTermQuery("3"); >>>>>>>>> allKeywords =3D new SpanNearQuery(clauses, Integer.MAX_VALUE, = false); >> // >>>>>>>>> SpanAndQuery equivalent >>>>>>>>> query =3D new SpanWithinQuery(allKeywords, endSentence, 0); >>>>>>>>> System.out.println("query: "+query); >>>>>>>>> hits =3D searcher.search(query, null, 1000).scoreDocs; >>>>>>>>> assertEquals(hits.length, 1); >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>> } >>>>>>>>>=20 >>>>>>>>> public SpanTermQuery makeSpanTermQuery(String text) { >>>>>>>>> return new SpanTermQuery(new Term(field, text)); >>>>>>>>> } >>>>>>>>> public TermQuery makeTermQuery(String text) { >>>>>>>>> return new TermQuery(new Term(field, text)); >>>>>>>>> } >>>>>>>>> } >>>>>>>>>=20 >>>>>>>>> Peter >>>>>>>>>=20 >>>>>>>>> On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller < >> markrmiller@gmail.com> >>>>>>>> wrote: >>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> On Jul 20, 2011, at 7:44 PM, Mark Miller wrote: >>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote: >>>>>>>>>>>=20 >>>>>>>>>>>> Mark Miller's 'SpanWithinQuery' patch >>>>>>>>>>>> seems to have the same issue. >>>>>>>>>>>=20 >>>>>>>>>>> If I remember right (It's been more the a couple years), I = did >> index >>>>>>>> the >>>>>>>>>> sentence markers at the same position as the last word in the >>>>> sentence. >>>>>>>> And >>>>>>>>>> I think the limitation that I ate was that the word could = belong >> to >>>>> both >>>>>>>>>> it's true sentence, and the one after it. >>>>>>>>>>>=20 >>>>>>>>>>> - Mark Miller >>>>>>>>>>> lucidimagination.com >>>>>>>>>>=20 >>>>>>>>>> Perhaps you could index the sentence marker at both the last = word >> of >>>>> the >>>>>>>>>> sentence as well as the first word of the next sentence if = there >> is >>>>> one. >>>>>>>>>> This would seem to solve the above limitation as well? >>>>>>>>>>=20 >>>>>>>>>> - Mark Miller >>>>>>>>>> lucidimagination.com >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>>=20 >> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: = java-user-unsubscribe@lucene.apache.org >>>>>>>>>> For additional commands, e-mail: = java-user-help@lucene.apache.org >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>=20 >>>>>>>> - Mark Miller >>>>>>>> lucidimagination.com >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>>>>> For additional commands, e-mail: = java-user-help@lucene.apache.org >>>>>>>>=20 >>>>>>>>=20 >>>>>>=20 >>>>>> - Mark Miller >>>>>> lucidimagination.com >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>=20 >>>>> - Mark Miller >>>>> lucidimagination.com >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>>=20 >>>>> = --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>>>=20 >>>>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >>=20 >>=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org