lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Lynch <ihas...@gmail.com>
Subject Re: Phrase Highlighting
Date Thu, 04 Jun 2009 00:45:32 GMT
On Wed, Jun 3, 2009 at 7:34 PM, Mark Miller <markrmiller@gmail.com> wrote:

> Max Lynch wrote:
>
>> Well what happens is if I use a SpanScorer instead, and allocate it like
>>>
>>>
>>
>>
>>
>>> such:
>>>>
>>>>           analyzer = StandardAnalyzer([])
>>>>           tokenStream = analyzer.tokenStream("contents",
>>>> lucene.StringReader(text))
>>>>           ctokenStream = lucene.CachingTokenFilter(tokenStream)
>>>>           highlighter = lucene.Highlighter(formatter,
>>>> lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
>>>>           ctokenStream.reset()
>>>>
>>>>           result = highlighter.getBestFragments(ctokenStream, text,
>>>>                   2, "...")
>>>>
>>>>  My highlighter is still breaking up words inside of a span.  For
>>>>
>>>>
>>> example,
>>>
>>>
>>>> if I search for \"John Smith\", instead of the highlighter being called
>>>>
>>>>
>>> for
>>>
>>>
>>>> the whole "John Smith", it gets called for "John" and then "Smith".
>>>>
>>>>
>>> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
>>> which is the default used by Highlighter) to ensure that each fragment
>>> contains a full match for the query.  EG something like this (copied
>>> from LIA 2nd edition):
>>>
>>>   TermQuery query = new TermQuery(new Term("field", "fox"));
>>>
>>>   TokenStream tokenStream =
>>>       new SimpleAnalyzer().tokenStream("field",
>>>           new StringReader(text));
>>>
>>>   SpanScorer scorer = new SpanScorer(query, "field",
>>>                                      new
>>> CachingTokenFilter(tokenStream));
>>>   Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>>>   Highlighter highlighter = new Highlighter(scorer);
>>>   highlighter.setTextFragmenter(fragmenter);
>>>
>>>
>>
>>
>>
>> Okay, I hacked something up in Java that illustrates my issue.
>>
>>
>> import org.apache.lucene.search.*;
>> import org.apache.lucene.analysis.*;
>> import org.apache.lucene.document.*;
>> import org.apache.lucene.index.IndexWriter;
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>> import org.apache.lucene.index.Term;
>> import org.apache.lucene.queryParser.QueryParser;
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.RAMDirectory;
>> import org.apache.lucene.search.highlight.*;
>> import org.apache.lucene.search.spans.SpanTermQuery;
>> import java.io.Reader;
>> import java.io.StringReader;
>>
>> public class PhraseTest {
>>    private IndexSearcher searcher;
>>    private RAMDirectory directory;
>>
>>    public PhraseTest() throws Exception {
>>        directory = new RAMDirectory();
>>
>>        Analyzer analyzer = new Analyzer() {
>>            public TokenStream tokenStream(String fieldName, Reader reader)
>> {
>>                return new WhitespaceTokenizer(reader);
>>            }
>>
>>            public int getPositionIncrementGap(String fieldName) {
>>                return 100;
>>            }
>>        };
>>
>>        IndexWriter writer = new IndexWriter(directory, analyzer, true,
>>                IndexWriter.MaxFieldLength.LIMITED);
>>
>>        Document doc = new Document();
>>        String text = "Jimbo John is his name";
>>        doc.add(new Field("contents", text, Field.Store.YES,
>> Field.Index.ANALYZED));
>>        writer.addDocument(doc);
>>
>>        writer.optimize();
>>        writer.close();
>>
>>        searcher = new IndexSearcher(directory);
>>
>>        // Try a phrase query
>>        PhraseQuery phraseQuery = new PhraseQuery();
>>        phraseQuery.add(new Term("contents", "Jimbo"));
>>        phraseQuery.add(new Term("contents", "John"));
>>
>>        // Try a SpanTermQuery
>>        SpanTermQuery spanTermQuery = new SpanTermQuery(new
>> Term("contents",
>> "Jimbo John"));
>>
>>        // Try a parsed query
>>        Query parsedQuery = new QueryParser("contents",
>> analyzer).parse("\"Jimbo John\"");
>>
>>        Hits hits = searcher.search(parsedQuery);
>>        System.out.println("We found " + hits.length() + " hits.");
>>
>>        // Highlight the results
>>        CachingTokenFilter tokenStream = new
>> CachingTokenFilter(analyzer.tokenStream( "contents", new
>> StringReader(text)));
>>
>>        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
>>
>>        SpanScorer sc = new SpanScorer(parsedQuery, "contents",
>> tokenStream,
>> "contents");
>>
>>        Highlighter highlighter = new Highlighter(formatter, sc);
>>        highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
>>        tokenStream.reset();
>>
>>        String rv = highlighter.getBestFragments(tokenStream, text, 1,
>> "...");
>>        System.out.println(rv);
>>
>>    }
>>    public static void main(String[] args) {
>>        System.out.println("Starting...");
>>        try {
>>            PhraseTest pt = new PhraseTest();
>>        } catch(Exception ex) {
>>            ex.printStackTrace();
>>        }
>>    }
>> }
>>
>>
>>
>> The output I'm getting is instead of highlighting <B>Jimbo John</B> it
>> does
>> <B>Jimbo</B> then <B>John</B>.  Can I get around this some
how?  I tried
>> several different query types (they are declared in the code, but only the
>> parsed version is being used).
>>
>> Thanks
>> -max
>>
>>
>>
> Sorry, not much you can do at the moment. The change is non trivial for
> sure (its probably easier to write some regex that merges them). This
> limitation was accepted because with most markup, it will display the same
> anyway. An option to merge would be great, and while I don't remember the
> details, the last time I looked, it just ain't easy to do based on the
> implementation. The highlighter highlights by running through and scoring
> tokens, not phrases, and the Span highlighter asks if a given token is in a
> given span to see if it should get a score over 0. Token by token handed off
> to the SpanScorer to be scored. I looked into adding the option at one point
> (back when I was putting the SpanScorer together) and didn't find it worth
> the effort after getting blocked a couple times.
>
>
Thanks anyways Mark.  Yea what I gathered from the results is that I will
only get hits and highlights for phrases if the whole phrase was found, but
they will be separated.  I just combine them now but was hoping for a more
elegant solution.   At least I know that what I'm highlighting aren't random
parts of the text, but the actual phrase, so all is not lost.

-max

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message