lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Spans, appended fields, and term positions
Date Sun, 20 Nov 2005 12:10:21 GMT
I'm working on building a custom highlighter for a client, which may  
eventually be generalizable.  In my work, I've come across some  
issues I'd like to discuss.  One issue is of appended fields allowing  
querying across boundaries.  For example, if I index two fields with  
the same name:

         doc.add(new Field("repeated", "this is a repeated field -  
first instance", Field.Store.YES,
                 Field.Index.TOKENIZED));
         doc.add(new Field("repeated", "this is a repeated field -  
second instance", Field.Store.YES,
                 Field.Index.TOKENIZED));

A query, of course by design, can span across those two field  
instances.  This query, against an index with only the above single  
document in it, shows the effect:

     SpanTermQuery first = new SpanTermQuery(new Term("repeated",  
"first"));
     SpanTermQuery second = new SpanTermQuery(new Term("repeated",  
"second"));
     SpanNearQuery wrapped = new SpanNearQuery(new SpanQuery[]  
{ first, second}, 7, true);

     IndexSearcher searcher = new IndexSearcher(reader);
     Hits hits = searcher.search(wrapped);
     assertEquals(1, hits.length());

So my first question is how could this match be prevented?   
Technically if the second "this" has a large position increment then  
there would be no match.  But how could I achieve that large position  
increment?  Does it make sense to add an IndexWriter setting to  
specify a default position increment gap to use when multiple fields  
are added in this way?  And likely a setting on a per-field basis to  
specify an increment offset to use for that individual field.  There  
isn't a way an Analyzer itself could address this situation, is there?

Highlighting is quite a challenging endeavor!  Spans certainly  
provides a lot of help, but in the appended field scenario, the  
Spans.start() and .end() goes across the field boundary, so it  
requires, in my case with the text coming from stored field values,  
cleverness in how to highlight in order to keep field instances  
separate.

So, should we make some changes in allowing offsets to be controllable?

Thanks,
     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message