Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of markrmiller@gmail.com
 designates 209.85.212.48 as permitted sender)
Content-Type: text/plain; charset=windows-1252
Mime-Version: 1.0 (Apple Message framework v1244.3)
Subject: Re: Search within a sentence (revisited)
From: Mark Miller <markrmiller@gmail.com>
In-Reply-To: 
 <CAN8y9rTeu0kbWhWgYcCL+a41VxJMTp=UQU21tbVe+x3qBZK9rQ@mail.gmail.com>
Date: Tue, 26 Jul 2011 09:11:51 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <AC83969B-B1BD-4B1D-96FE-4EE54C617AEF@gmail.com>
References: 
 <CAN8y9rR43XurPBCXKcofDhTzx+G0=6-dG2jna9y4jEz_onzbMw@mail.gmail.com>
 <7DD18AE8-B81B-4EFD-BD43-E6D866AF002D@gmail.com>
 <CBBA9901-60D9-492C-B932-91CDF67446D0@gmail.com>
 <CAN8y9rTs3=w9qjLVTEfiK5wptj=NjXKuU9X=vvXe-WXh+TD-GQ@mail.gmail.com>
 <99EC18FA-B784-433D-A024-014694A6FD5E@gmail.com>
 <CAN8y9rS5EmL-_kH8qXSd+rvr1uqNTsmffhSws690mVUNm6GNPw@mail.gmail.com>
 <8329E98E-70D2-4314-A135-2FD5A699B91B@gmail.com>
 <7711A405-BCB2-44D6-AE0E-C3F87C61B24C@gmail.com>
 <CAN8y9rQAptD9Zh73u-mj8bVUhtzrKZ4eBwMAy-m3yY4ZBTZNmg@mail.gmail.com>
 <64BC40C9-32A6-415E-831B-396A9E869FD0@gmail.com>
 <09F52BB3-B980-4200-BB30-0130B56CF5B7@gmail.com>
 <CAN8y9rTeu0kbWhWgYcCL+a41VxJMTp=UQU21tbVe+x3qBZK9rQ@mail.gmail.com>
To: java-user@lucene.apache.org

As long as you are happy with the results, I'm good. Always nice to have =
an excuse to dip back into Lucene. Just don't want you to feel over =
confident with the code without proper testing of it - I coded to fix =
the broken tests rather than taking the time to write a bunch more =
corner case tests like I likely should try if I was going to commit this =
thing.

- Mark Miller
lucidimagination.com

On Jul 26, 2011, at 8:56 AM, Peter Keegan wrote:

> Thanks Mark! The new patch is working fine with the tests and a few =
more. If
> you have particular test cases in mind, I'd be happy to add them.
>=20
> Thanks,
> Peter
>=20
> On Mon, Jul 25, 2011 at 5:56 PM, Mark Miller <markrmiller@gmail.com> =
wrote:
>=20
>> Sorry Peter - I introduced this problem with some kind of typo type =
issue -
>> I somehow changed an includeSpans variable to excludeSpans - but I =
certainly
>> didn't mean too - it makes no sense. So not sure how it happened, and
>> surprised the tests that passed still passed!
>>=20
>> We could probably use even more tests before feeling too confident =
here=85
>>=20
>> I've attached a patch for 3X with the new test and fix (changed that
>> include back to exclude).
>>=20
>> - Mark Miller
>> lucidimagination.com
>>=20
>> On Jul 25, 2011, at 10:29 AM, Mark Miller wrote:
>>=20
>>> Thanks Peter - if you supply the unit tests, I'm happy to work on =
the
>> fixes.
>>>=20
>>> I can likely look at this later today.
>>>=20
>>> - Mark Miller
>>> lucidimagination.com
>>>=20
>>> On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote:
>>>=20
>>>> Hi Mark,
>>>>=20
>>>> Sorry to bug you again, but there's another case that fails the =
unit
>> test
>>>> (search within the second sentence), as shown here in the last =
test:
>>>>=20
>>>> package org.apache.lucene.search.spans;
>>>>=20
>>>> import java.io.Reader;
>>>>=20
>>>> import org.apache.lucene.analysis.Analyzer;
>>>> import org.apache.lucene.analysis.TokenStream;
>>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
>>>> import
>>>> =
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
>>>> import =
org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>>>> import org.apache.lucene.document.Document;
>>>> import org.apache.lucene.document.Field;
>>>> import org.apache.lucene.index.IndexReader;
>>>> import org.apache.lucene.index.RandomIndexWriter;
>>>> import org.apache.lucene.index.Term;
>>>> import org.apache.lucene.store.Directory;
>>>> import org.apache.lucene.search.IndexSearcher;
>>>> import org.apache.lucene.search.PhraseQuery;
>>>> import org.apache.lucene.search.ScoreDoc;
>>>> import org.apache.lucene.search.TermQuery;
>>>> import org.apache.lucene.search.spans.SpanNearQuery;
>>>> import org.apache.lucene.search.spans.SpanQuery;
>>>> import org.apache.lucene.search.spans.SpanTermQuery;
>>>> import org.apache.lucene.util.LuceneTestCase;
>>>>=20
>>>> public class TestSentence extends LuceneTestCase {
>>>> public static final String field =3D "field";
>>>> public static final String START =3D "^";
>>>> public static final String END =3D "$";
>>>> public void testSetPosition() throws Exception {
>>>> Analyzer analyzer =3D new Analyzer() {
>>>> @Override
>>>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>>> return new TokenStream() {
>>>> private final String[] TOKENS =3D {"1", "2", "3", END, "4", "5", =
"6", END,
>>>> "9"};
>>>> private final int[] INCREMENTS =3D {1,1,1,0,1,1,1,0,1};
>>>> private int i =3D 0;
>>>> PositionIncrementAttribute posIncrAtt =3D
>>>> addAttribute(PositionIncrementAttribute.class);
>>>> CharTermAttribute termAtt =3D =
addAttribute(CharTermAttribute.class);
>>>> OffsetAttribute offsetAtt =3D addAttribute(OffsetAttribute.class);
>>>> @Override
>>>> public boolean incrementToken() {
>>>> assertEquals(TOKENS.length, INCREMENTS.length);
>>>> if (i =3D=3D TOKENS.length)
>>>> return false;
>>>> clearAttributes();
>>>> termAtt.append(TOKENS[i]);
>>>> offsetAtt.setOffset(i,i);
>>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
>>>> i++;
>>>> return true;
>>>> }
>>>> };
>>>> }
>>>> };
>>>> Directory store =3D newDirectory();
>>>> RandomIndexWriter writer =3D new RandomIndexWriter(random, store,
>> analyzer);
>>>> Document d =3D new Document();
>>>> d.add(newField("field", "bogus", Field.Store.YES,
>> Field.Index.ANALYZED));
>>>> writer.addDocument(d);
>>>> IndexReader reader =3D writer.getReader();
>>>> writer.close();
>>>> IndexSearcher searcher =3D newSearcher(reader);
>>>> SpanTermQuery startSentence =3D makeSpanTermQuery(START);
>>>> SpanTermQuery endSentence =3D makeSpanTermQuery(END);
>>>> SpanQuery[] clauses =3D new SpanQuery[2];
>>>> clauses[0] =3D makeSpanTermQuery("1");
>>>> clauses[1] =3D makeSpanTermQuery("2");
>>>> SpanNearQuery allKeywords =3D new SpanNearQuery(clauses,
>> Integer.MAX_VALUE,
>>>> false); // SpanAndQuery equivalent
>>>> SpanWithinQuery query =3D new SpanWithinQuery(allKeywords, =
endSentence,
>> 0);
>>>> System.out.println("query: "+query);
>>>> ScoreDoc[] hits =3D searcher.search(query, null, 1000).scoreDocs;
>>>> assertEquals(1, hits.length);
>>>> clauses[1] =3D makeSpanTermQuery("4");
>>>> allKeywords =3D new SpanNearQuery(clauses, Integer.MAX_VALUE, =
false); //
>>>> SpanAndQuery equivalent
>>>> query =3D new SpanWithinQuery(allKeywords, endSentence, 0);
>>>> System.out.println("query: "+query);
>>>> hits =3D searcher.search(query, null, 1000).scoreDocs;
>>>> assertEquals(0, hits.length);
>>>> PhraseQuery pq =3D new PhraseQuery();
>>>> pq.add(new Term(field, "3"));
>>>> pq.add(new Term(field, "4"));
>>>> System.out.println("query: "+pq);
>>>> hits =3D searcher.search(pq, null, 1000).scoreDocs;
>>>> assertEquals(1, hits.length);
>>>> clauses[0] =3D makeSpanTermQuery("4");
>>>> clauses[1] =3D makeSpanTermQuery("6");
>>>> allKeywords =3D new SpanNearQuery(clauses, Integer.MAX_VALUE, =
false); //
>>>> SpanAndQuery equivalent
>>>> query =3D new SpanWithinQuery(allKeywords, endSentence, 0);
>>>> System.out.println("query: "+query);
>>>> hits =3D searcher.search(query, null, 1000).scoreDocs;
>>>> assertEquals(1, hits.length);
>>>> }
>>>>=20
>>>> public SpanTermQuery makeSpanTermQuery(String text) {
>>>> return new SpanTermQuery(new Term(field, text));
>>>> }
>>>> public TermQuery makeTermQuery(String text) {
>>>> return new TermQuery(new Term(field, text));
>>>> }
>>>> }
>>>>=20
>>>> Peter
>>>>=20
>>>> On Thu, Jul 21, 2011 at 5:23 PM, Mark Miller =
<markrmiller@gmail.com>
>> wrote:
>>>>=20
>>>>>=20
>>>>> I just uploaded a patch for 3X that will work for 3.2.
>>>>>=20
>>>>> On Jul 21, 2011, at 4:25 PM, Mark Miller wrote:
>>>>>=20
>>>>>> Yeah, it's off trunk - I'll submit a 3X patch in a bit - just =
have to
>>>>> change that to an IndexReader I believe.
>>>>>>=20
>>>>>> - Mark
>>>>>>=20
>>>>>> On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote:
>>>>>>=20
>>>>>>> Does this patch require the trunk version? I'm using 3.2 and
>>>>>>> 'AtomicReaderContext' isn't there.
>>>>>>>=20
>>>>>>> Peter
>>>>>>>=20
>>>>>>> On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller =
<markrmiller@gmail.com>
>>>>> wrote:
>>>>>>>=20
>>>>>>>> Hey Peter,
>>>>>>>>=20
>>>>>>>> Getting sucked back into Spans...
>>>>>>>>=20
>>>>>>>> That test should pass now - I uploaded a new patch to
>>>>>>>> https://issues.apache.org/jira/browse/LUCENE-777
>>>>>>>>=20
>>>>>>>> Further tests may be needed though.
>>>>>>>>=20
>>>>>>>> - Mark
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote:
>>>>>>>>=20
>>>>>>>>> Hi Mark,
>>>>>>>>>=20
>>>>>>>>> Here is a unit test using a version of 'SpanWithinQuery' =
modified
>> for
>>>>> 3.2
>>>>>>>>> ('getTerms' removed) . The last test fails (search for "1" and
>> "3").
>>>>>>>>>=20
>>>>>>>>> package org.apache.lucene.search.spans;
>>>>>>>>>=20
>>>>>>>>> import java.io.Reader;
>>>>>>>>>=20
>>>>>>>>> import org.apache.lucene.analysis.Analyzer;
>>>>>>>>> import org.apache.lucene.analysis.TokenStream;
>>>>>>>>> import =
org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
>>>>>>>>> import
>>>>>>>>>=20
>> =
org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
>>>>>>>>> import
>> org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
>>>>>>>>> import org.apache.lucene.document.Document;
>>>>>>>>> import org.apache.lucene.document.Field;
>>>>>>>>> import org.apache.lucene.index.IndexReader;
>>>>>>>>> import org.apache.lucene.index.RandomIndexWriter;
>>>>>>>>> import org.apache.lucene.index.Term;
>>>>>>>>> import org.apache.lucene.store.Directory;
>>>>>>>>> import org.apache.lucene.search.IndexSearcher;
>>>>>>>>> import org.apache.lucene.search.PhraseQuery;
>>>>>>>>> import org.apache.lucene.search.ScoreDoc;
>>>>>>>>> import org.apache.lucene.search.TermQuery;
>>>>>>>>> import org.apache.lucene.search.spans.SpanNearQuery;
>>>>>>>>> import org.apache.lucene.search.spans.SpanQuery;
>>>>>>>>> import org.apache.lucene.search.spans.SpanTermQuery;
>>>>>>>>> import org.apache.lucene.util.LuceneTestCase;
>>>>>>>>>=20
>>>>>>>>> public class TestSentence extends LuceneTestCase {
>>>>>>>>> public static final String field =3D "field";
>>>>>>>>> public static final String START =3D "^";
>>>>>>>>> public static final String END =3D "$";
>>>>>>>>> public void testSetPosition() throws Exception {
>>>>>>>>> Analyzer analyzer =3D new Analyzer() {
>>>>>>>>> @Override
>>>>>>>>> public TokenStream tokenStream(String fieldName, Reader =
reader) {
>>>>>>>>> return new TokenStream() {
>>>>>>>>> private final String[] TOKENS =3D {"1", "2", "3", END, "4", =
"5", "6",
>>>>> END,
>>>>>>>>> "9"};
>>>>>>>>> private final int[] INCREMENTS =3D {1,1,1,0,1,1,1,0,1};
>>>>>>>>> private int i =3D 0;
>>>>>>>>>=20
>>>>>>>>> PositionIncrementAttribute posIncrAtt =3D
>>>>>>>>> addAttribute(PositionIncrementAttribute.class);
>>>>>>>>> CharTermAttribute termAtt =3D =
addAttribute(CharTermAttribute.class);
>>>>>>>>> OffsetAttribute offsetAtt =3D =
addAttribute(OffsetAttribute.class);
>>>>>>>>>=20
>>>>>>>>> @Override
>>>>>>>>> public boolean incrementToken() {
>>>>>>>>> assertEquals(TOKENS.length, INCREMENTS.length);
>>>>>>>>> if (i =3D=3D TOKENS.length)
>>>>>>>>> return false;
>>>>>>>>> clearAttributes();
>>>>>>>>> termAtt.append(TOKENS[i]);
>>>>>>>>> offsetAtt.setOffset(i,i);
>>>>>>>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]);
>>>>>>>>> i++;
>>>>>>>>> return true;
>>>>>>>>> }
>>>>>>>>> };
>>>>>>>>> }
>>>>>>>>> };
>>>>>>>>> Directory store =3D newDirectory();
>>>>>>>>> RandomIndexWriter writer =3D new RandomIndexWriter(random, =
store,
>>>>>>>> analyzer);
>>>>>>>>> Document d =3D new Document();
>>>>>>>>> d.add(newField("field", "bogus", Field.Store.YES,
>>>>> Field.Index.ANALYZED));
>>>>>>>>> writer.addDocument(d);
>>>>>>>>> IndexReader reader =3D writer.getReader();
>>>>>>>>> writer.close();
>>>>>>>>> IndexSearcher searcher =3D newSearcher(reader);
>>>>>>>>>=20
>>>>>>>>> SpanTermQuery startSentence =3D makeSpanTermQuery(START);
>>>>>>>>> SpanTermQuery endSentence =3D makeSpanTermQuery(END);
>>>>>>>>> SpanQuery[] clauses =3D new SpanQuery[2];
>>>>>>>>> clauses[0] =3D makeSpanTermQuery("1");
>>>>>>>>> clauses[1] =3D makeSpanTermQuery("2");
>>>>>>>>> SpanNearQuery allKeywords =3D new SpanNearQuery(clauses,
>>>>> Integer.MAX_VALUE,
>>>>>>>>> false); // SpanAndQuery equivalent
>>>>>>>>> SpanWithinQuery query =3D new SpanWithinQuery(allKeywords,
>> endSentence,
>>>>> 0);
>>>>>>>>> System.out.println("query: "+query);
>>>>>>>>> ScoreDoc[] hits =3D searcher.search(query, null, =
1000).scoreDocs;
>>>>>>>>> assertEquals(hits.length, 1);
>>>>>>>>>=20
>>>>>>>>> clauses[1] =3D makeSpanTermQuery("4");
>>>>>>>>> allKeywords =3D new SpanNearQuery(clauses, Integer.MAX_VALUE, =
false);
>> //
>>>>>>>>> SpanAndQuery equivalent
>>>>>>>>> query =3D new SpanWithinQuery(allKeywords, endSentence, 0);
>>>>>>>>> System.out.println("query: "+query);
>>>>>>>>> hits =3D searcher.search(query, null, 1000).scoreDocs;
>>>>>>>>> assertEquals(hits.length, 0);
>>>>>>>>>=20
>>>>>>>>> PhraseQuery pq =3D new PhraseQuery();
>>>>>>>>> pq.add(new Term(field, "3"));
>>>>>>>>> pq.add(new Term(field, "4"));
>>>>>>>>> hits =3D searcher.search(pq, null, 1000).scoreDocs;
>>>>>>>>> assertEquals(hits.length, 1);
>>>>>>>>>=20
>>>>>>>>> clauses[1] =3D makeSpanTermQuery("3");
>>>>>>>>> allKeywords =3D new SpanNearQuery(clauses, Integer.MAX_VALUE, =
false);
>> //
>>>>>>>>> SpanAndQuery equivalent
>>>>>>>>> query =3D new SpanWithinQuery(allKeywords, endSentence, 0);
>>>>>>>>> System.out.println("query: "+query);
>>>>>>>>> hits =3D searcher.search(query, null, 1000).scoreDocs;
>>>>>>>>> assertEquals(hits.length, 1);
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>> }
>>>>>>>>>=20
>>>>>>>>> public SpanTermQuery makeSpanTermQuery(String text) {
>>>>>>>>> return new SpanTermQuery(new Term(field, text));
>>>>>>>>> }
>>>>>>>>> public TermQuery makeTermQuery(String text) {
>>>>>>>>> return new TermQuery(new Term(field, text));
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>=20
>>>>>>>>> Peter
>>>>>>>>>=20
>>>>>>>>> On Wed, Jul 20, 2011 at 9:22 PM, Mark Miller <
>> markrmiller@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>> On Jul 20, 2011, at 7:44 PM, Mark Miller wrote:
>>>>>>>>>>=20
>>>>>>>>>>>=20
>>>>>>>>>>> On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote:
>>>>>>>>>>>=20
>>>>>>>>>>>> Mark Miller's 'SpanWithinQuery' patch
>>>>>>>>>>>> seems to have the same issue.
>>>>>>>>>>>=20
>>>>>>>>>>> If I remember right (It's been more the a couple years), I =
did
>> index
>>>>>>>> the
>>>>>>>>>> sentence markers at the same position as the last word in the
>>>>> sentence.
>>>>>>>> And
>>>>>>>>>> I think the limitation that I ate was that the word could =
belong
>> to
>>>>> both
>>>>>>>>>> it's true sentence, and the one after it.
>>>>>>>>>>>=20
>>>>>>>>>>> - Mark Miller
>>>>>>>>>>> lucidimagination.com
>>>>>>>>>>=20
>>>>>>>>>> Perhaps you could index the sentence marker at both the last =
word
>> of
>>>>> the
>>>>>>>>>> sentence as well as the first word of the next sentence if =
there
>> is
>>>>> one.
>>>>>>>>>> This would seem to solve the above limitation as well?
>>>>>>>>>>=20
>>>>>>>>>> - Mark Miller
>>>>>>>>>> lucidimagination.com
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>>=20
>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: =
java-user-unsubscribe@lucene.apache.org
>>>>>>>>>> For additional commands, e-mail: =
java-user-help@lucene.apache.org
>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> - Mark Miller
>>>>>>>> lucidimagination.com
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>>> For additional commands, e-mail: =
java-user-help@lucene.apache.org
>>>>>>>>=20
>>>>>>>>=20
>>>>>>=20
>>>>>> - Mark Miller
>>>>>> lucidimagination.com
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>=20
>>>>> - Mark Miller
>>>>> lucidimagination.com
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>>=20
>>>>> =
---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>=20
>>>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>=20
>>=20


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org