lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Lynch <ihas...@gmail.com>
Subject Re: Phrase Highlighting
Date Wed, 03 Jun 2009 00:57:44 GMT
> Well what happens is if I use a SpanScorer instead, and allocate it like

> > such:
> >
> >            analyzer = StandardAnalyzer([])
> >            tokenStream = analyzer.tokenStream("contents",
> > lucene.StringReader(text))
> >            ctokenStream = lucene.CachingTokenFilter(tokenStream)
> >            highlighter = lucene.Highlighter(formatter,
> > lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
> >            ctokenStream.reset()
> >
> >            result = highlighter.getBestFragments(ctokenStream, text,
> >                    2, "...")
> >
> >  My highlighter is still breaking up words inside of a span.  For
> example,
> > if I search for \"John Smith\", instead of the highlighter being called
> for
> > the whole "John Smith", it gets called for "John" and then "Smith".
>
> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
> which is the default used by Highlighter) to ensure that each fragment
> contains a full match for the query.  EG something like this (copied
> from LIA 2nd edition):
>
>    TermQuery query = new TermQuery(new Term("field", "fox"));
>
>    TokenStream tokenStream =
>        new SimpleAnalyzer().tokenStream("field",
>            new StringReader(text));
>
>    SpanScorer scorer = new SpanScorer(query, "field",
>                                       new CachingTokenFilter(tokenStream));
>    Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
>    Highlighter highlighter = new Highlighter(scorer);
>    highlighter.setTextFragmenter(fragmenter);



Okay, I hacked something up in Java that illustrates my issue.


import org.apache.lucene.search.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.search.spans.SpanTermQuery;
import java.io.Reader;
import java.io.StringReader;

public class PhraseTest {
    private IndexSearcher searcher;
    private RAMDirectory directory;

    public PhraseTest() throws Exception {
        directory = new RAMDirectory();

        Analyzer analyzer = new Analyzer() {
            public TokenStream tokenStream(String fieldName, Reader reader)
{
                return new WhitespaceTokenizer(reader);
            }

            public int getPositionIncrementGap(String fieldName) {
                return 100;
            }
        };

        IndexWriter writer = new IndexWriter(directory, analyzer, true,
                IndexWriter.MaxFieldLength.LIMITED);

        Document doc = new Document();
        String text = "Jimbo John is his name";
        doc.add(new Field("contents", text, Field.Store.YES,
Field.Index.ANALYZED));
        writer.addDocument(doc);

        writer.optimize();
        writer.close();

        searcher = new IndexSearcher(directory);

        // Try a phrase query
        PhraseQuery phraseQuery = new PhraseQuery();
        phraseQuery.add(new Term("contents", "Jimbo"));
        phraseQuery.add(new Term("contents", "John"));

        // Try a SpanTermQuery
        SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("contents",
"Jimbo John"));

        // Try a parsed query
        Query parsedQuery = new QueryParser("contents",
analyzer).parse("\"Jimbo John\"");

        Hits hits = searcher.search(parsedQuery);
        System.out.println("We found " + hits.length() + " hits.");

        // Highlight the results
        CachingTokenFilter tokenStream = new
CachingTokenFilter(analyzer.tokenStream( "contents", new
StringReader(text)));

        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();

        SpanScorer sc = new SpanScorer(parsedQuery, "contents", tokenStream,
"contents");

        Highlighter highlighter = new Highlighter(formatter, sc);
        highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
        tokenStream.reset();

        String rv = highlighter.getBestFragments(tokenStream, text, 1,
"...");
        System.out.println(rv);

    }
    public static void main(String[] args) {
        System.out.println("Starting...");
        try {
            PhraseTest pt = new PhraseTest();
        } catch(Exception ex) {
            ex.printStackTrace();
        }
    }
}



The output I'm getting is instead of highlighting <B>Jimbo John</B> it does
<B>Jimbo</B> then <B>John</B>.  Can I get around this some how?  I
tried
several different query types (they are declared in the code, but only the
parsed version is being used).

Thanks
-max

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message