lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Angel, Eric" <ean...@business.com>
Subject RE: querying multi-value fields
Date Mon, 12 Oct 2009 23:32:32 GMT
Erick,

Thank you.  This is awesome.  I got it to work by just setting slop to 1
and returning 10 in my analyzer.getPositionIncrementGap.  Here are my
tests in case anyone else is interested:


public class TestPositionIncrementGap extends TestCase {
	
	Analyzer analyzer = new KeywordAnalyzer();
	RAMDirectory dir = null;
	IndexWriter writer = null;
	IndexSearcher searcher = null;
	@Override
	protected void setUp() throws Exception {
		super.setUp();
		dir = new RAMDirectory();
		writer = new IndexWriter(dir, analyzer,
MaxFieldLength.LIMITED);
		
		Document d = new Document();
		d.add(new Field("keyword", "aaa bbb", Field.Store.YES,
Field.Index.ANALYZED));
		d.add(new Field("keyword", "ccc ddd eee",
Field.Store.YES, Field.Index.ANALYZED));
		d.add(new Field("keyword", "fff fff", Field.Store.YES,
Field.Index.ANALYZED));
		d.add(new Field("keyword", "banks sales",
Field.Store.YES, Field.Index.ANALYZED));
		d.add(new Field("keyword", "sales mans",
Field.Store.YES, Field.Index.ANALYZED));
		writer.addDocument(d);
		writer.close();
	}
	@Override
	protected void tearDown() throws Exception {
		super.tearDown();
		dir = null;
		writer = null;
		searcher = null;
	}
	
	public void testMatchOnSingleExactValue() throws Exception {
		searcher = new IndexSearcher(dir, true);
		PhraseQuery pq = new PhraseQuery();
		pq.setSlop(1);
		pq.add(new Term("keyword", "aaa"));
		pq.add(new Term("keyword", "bbb"));
		
		TopDocs td = searcher.search(pq, 100);
		searcher.close();
		assertEquals(1, td.totalHits);
		
	}
	
	public void testNoMatchOnAdjacentValue() throws Exception {
		searcher = new IndexSearcher(dir, true);
		PhraseQuery pq = new PhraseQuery();
		pq.setSlop(1);
		pq.add(new Term("keyword", "bbb"));
		pq.add(new Term("keyword", "ccc"));
		
		TopDocs td = searcher.search(pq, 100);
		assertEquals(0, td.totalHits);
		searcher.close();
	}
	
	public void testNoMatchOnSingleReveredValue() throws Exception {
		searcher = new IndexSearcher(dir, true);
		PhraseQuery pq = new PhraseQuery();
		pq.setSlop(1);
		pq.add(new Term("keyword", "bbb"));
		pq.add(new Term("keyword", "aaa"));
		
		TopDocs td = searcher.search(pq, 100);
		assertEquals(0, td.totalHits);
		searcher.close();
	}
	
	public void testNoMatchOnSpannedValue() throws Exception {
		searcher = new IndexSearcher(dir, true);
		PhraseQuery pq = new PhraseQuery();
		pq.setSlop(1);
		pq.add(new Term("keyword", "aaa"));
		pq.add(new Term("keyword", "mans"));
		
		TopDocs td = searcher.search(pq, 100);
		assertEquals(0, td.totalHits);
		searcher.close();
	}
	
	public void testMatchOnSingleExactValue2() throws Exception {
		searcher = new IndexSearcher(dir, true);
		PhraseQuery pq = new PhraseQuery();
		pq.setSlop(1);
		pq.add(new Term("keyword", "fff"));
		pq.add(new Term("keyword", "fff"));
		
		TopDocs td = searcher.search(pq, 100);
		searcher.close();
		assertEquals(1, td.totalHits);
		
	}
}

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Monday, October 12, 2009 1:48 PM
To: java-user@lucene.apache.org
Subject: Re: querying multi-value fields

<<<I think Lucene sees all these values as one long
value for the field "option">>>
Not quite. Starting with the second add, a call will be made to
getPositionIncrementGap in your analyzer. If you return a number
larger than one, then the offsets between the last term of the
preceeding
add and the first term of this add will be that number. If you do
nothing
with getPositionIncrementGap, then Lucene does, indeed, see all
the terms as one long value since it returns 1.....

Here's an illusrtation in which PositionIncrementGap returns
an offset of 10 using your example adds.

Note, these numbers may be off by the infamous 1 or even 2.

term      offset
value1     0
aaa         1
value2     11
bbb         12
value3     22
ccc        23


So, you really don't care about the slop, since you can set it to less
than
the
magic number you return from PositionIncrementGap. BTW, slop indicates
holes, not total terms. So with a slop of 0 all the words need to be
next to
each other, regardless of whether there are two words or 20. But you
still
have to do the trick with getPositionIncrementGap in order to fail to
match
on something like "bbb value3", where the last term is next to the frist
term
of the next token......

HTH
Erick



On Mon, Oct 12, 2009 at 4:31 PM, Angel, Eric <eangel@business.com>
wrote:

> I need to analyze these values since I also want the benefits
> porterStemmer.  The problem with using PhraseQuery is that I don't
> always know the slop.  I may have values like "value4 ddd aaa".  It's
a
> tricky problem because I think Lucene sees all these values as one
long
> value for the field "option".
>
> -----Original Message-----
> From: Jake Mannix [mailto:jake.mannix@gmail.com]
> Sent: Monday, October 12, 2009 1:25 PM
> To: java-user@lucene.apache.org
> Subject: Re: querying multi-value fields
>
> Or else just make sure that you use PhraseQuery to hit this field when
> you
> want "value1 aaa".  If you don't tokenize these pairs, then you will
> have to
>
> do prefix/wildcard matching to hit just "value1" by itself (if this is
> allowed
> by your business logic).
>
>  -jake
>
> On Mon, Oct 12, 2009 at 1:21 PM, Adriano Crestani
> <adrianocrestani@gmail.com
> > wrote:
>
> > Hi Eric,
> >
> > To achieve what you want, do not tokenize the values you query/add
to
> this
> > field.
> >
> > On Mon, Oct 12, 2009 at 4:05 PM, Angel, Eric <eangel@business.com>
> wrote:
> >
> > > I have documents that store multiple values in some fields (using
> the
> > > document.add(new Field()) with the same field name).  Here's what
a
> > > typical document looks like:
> > >
> > >
> > >
> > > doc.option="value1 aaa"
> > >
> > > doc.option="value2 bbb"
> > >
> > > doc.option="value3 ccc"
> > >
> > >
> > >
> > > I want my queries to only match individual values, for example, a
> query
> > > for "value2 bbb" would return the above document, but a query for
> > > "value1 ccc" should not.  Is this at all possible in lucene at
query
> > > time?  Could payloads be used for this?
> > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message