lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Phrase Highlighting
Date Thu, 04 Jun 2009 00:34:26 GMT
Max Lynch wrote:
>> Well what happens is if I use a SpanScorer instead, and allocate it like
>>     
>
>   
>>> such:
>>>
>>>            analyzer = StandardAnalyzer([])
>>>            tokenStream = analyzer.tokenStream("contents",
>>> lucene.StringReader(text))
>>>            ctokenStream = lucene.CachingTokenFilter(tokenStream)
>>>            highlighter = lucene.Highlighter(formatter,
>>> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
>>>            ctokenStream.reset()
>>>
>>>            result = highlighter.getBestFragments(ctokenStream, text,
>>>                    2, "...")
>>>
>>>  My highlighter is still breaking up words inside of a span.  For
>>>       
>> example,
>>     
>>> if I search for \"John Smith\", instead of the highlighter being called
>>>       
>> for
>>     
>>> the whole "John Smith", it gets called for "John" and then "Smith".
>>>       
>> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
>> which is the default used by Highlighter) to ensure that each fragment
>> contains a full match for the query.  EG something like this (copied
>> from LIA 2nd edition):
>>
>>    TermQuery query = new TermQuery(new Term("field", "fox"));
>>
>>    TokenStream tokenStream =
>>        new SimpleAnalyzer().tokenStream("field",
>>            new StringReader(text));
>>
>>    SpanScorer scorer = new SpanScorer(query, "field",
>>                                       new CachingTokenFilter(tokenStream));
>>    Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>>    Highlighter highlighter = new Highlighter(scorer);
>>    highlighter.setTextFragmenter(fragmenter);
>>     
>
>
>
> Okay, I hacked something up in Java that illustrates my issue.
>
>
> import org.apache.lucene.search.*;
> import org.apache.lucene.analysis.*;
> import org.apache.lucene.document.*;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.queryParser.QueryParser;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.search.highlight.*;
> import org.apache.lucene.search.spans.SpanTermQuery;
> import java.io.Reader;
> import java.io.StringReader;
>
> public class PhraseTest {
>     private IndexSearcher searcher;
>     private RAMDirectory directory;
>
>     public PhraseTest() throws Exception {
>         directory = new RAMDirectory();
>
>         Analyzer analyzer = new Analyzer() {
>             public TokenStream tokenStream(String fieldName, Reader reader)
> {
>                 return new WhitespaceTokenizer(reader);
>             }
>
>             public int getPositionIncrementGap(String fieldName) {
>                 return 100;
>             }
>         };
>
>         IndexWriter writer = new IndexWriter(directory, analyzer, true,
>                 IndexWriter.MaxFieldLength.LIMITED);
>
>         Document doc = new Document();
>         String text = "Jimbo John is his name";
>         doc.add(new Field("contents", text, Field.Store.YES,
> Field.Index.ANALYZED));
>         writer.addDocument(doc);
>
>         writer.optimize();
>         writer.close();
>
>         searcher = new IndexSearcher(directory);
>
>         // Try a phrase query
>         PhraseQuery phraseQuery = new PhraseQuery();
>         phraseQuery.add(new Term("contents", "Jimbo"));
>         phraseQuery.add(new Term("contents", "John"));
>
>         // Try a SpanTermQuery
>         SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("contents",
> "Jimbo John"));
>
>         // Try a parsed query
>         Query parsedQuery = new QueryParser("contents",
> analyzer).parse("\"Jimbo John\"");
>
>         Hits hits = searcher.search(parsedQuery);
>         System.out.println("We found " + hits.length() + " hits.");
>
>         // Highlight the results
>         CachingTokenFilter tokenStream = new
> CachingTokenFilter(analyzer.tokenStream( "contents", new
> StringReader(text)));
>
>         SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
>
>         SpanScorer sc = new SpanScorer(parsedQuery, "contents", tokenStream,
> "contents");
>
>         Highlighter highlighter = new Highlighter(formatter, sc);
>         highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
>         tokenStream.reset();
>
>         String rv = highlighter.getBestFragments(tokenStream, text, 1,
> "...");
>         System.out.println(rv);
>
>     }
>     public static void main(String[] args) {
>         System.out.println("Starting...");
>         try {
>             PhraseTest pt = new PhraseTest();
>         } catch(Exception ex) {
>             ex.printStackTrace();
>         }
>     }
> }
>
>
>
> The output I'm getting is instead of highlighting <B>Jimbo John</B> it does
> <B>Jimbo</B> then <B>John</B>.  Can I get around this some how?
 I tried
> several different query types (they are declared in the code, but only the
> parsed version is being used).
>
> Thanks
> -max
>
>   
Sorry, not much you can do at the moment. The change is non trivial for 
sure (its probably easier to write some regex that merges them). This 
limitation was accepted because with most markup, it will display the 
same anyway. An option to merge would be great, and while I don't 
remember the details, the last time I looked, it just ain't easy to do 
based on the implementation. The highlighter highlights by running 
through and scoring tokens, not phrases, and the Span highlighter asks 
if a given token is in a given span to see if it should get a score over 
0. Token by token handed off to the SpanScorer to be scored. I looked 
into adding the option at one point (back when I was putting the 
SpanScorer together) and didn't find it worth the effort after getting 
blocked a couple times.


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message