lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Stoppelman" <stop...@gmail.com>
Subject Re: Highlighter that works with phrase and span queries
Date Mon, 27 Aug 2007 16:06:18 GMT
Ah, much clearer now. It seems that the jar file is just the class files. Is
the source/javadoc code somewhere else?

-M

On 8/27/07, Mark Miller <markrmiller@gmail.com> wrote:
>
> I am a bit unclear about your question. The patch you mention extends
> the original Highlighter to support phrase and span queries. It does not
> include any major performance increases over the original Highlighter
> (in fact, it takes a bit longer to Highlight a Span or Phrase query than
> it does to just highlight Terms).
>
> Will it be released with the next version of Lucene? Doesn't look like
> it, but anything is possible. A few people are using it, but there has
> not been widespread interest that I have seen. My guess is that there
> are just not enough people trying to highlight Span queries -- which I'd
> blame on a lack of Span support in the default Lucene Query syntax.
>
> Whether it is included soon or not, the code works well and I will
> continue to support it.
>
> - Mark
>
> Michael Stoppelman wrote:
> > Is this jar going to be in the next release of lucene? Also, are these
> the
> > same as the changes in the following patch:
> >
> https://issues.apache.org/jira/secure/attachment/12362653/spanhighlighter10.patch
> >
> > -M
> >
> > On 6/27/07, Mark Miller <markrmiller@gmail.com> wrote:
> >
> >>
> >>> I have not looked at any highlighting code yet. Is there already an
> >>>
> >> extension
> >>
> >>> of PhraseQuery that has getSpans() ?
> >>>
> >>>
> >> Currently I am using this code originally by M. Harwood:
> >>             Term[] phraseQueryTerms = ((PhraseQuery) query).getTerms();
> >>             int i;
> >>             SpanQuery[] clauses = new SpanQuery[phraseQueryTerms.length
> ];
> >>
> >>             for (i = 0; i < phraseQueryTerms.length; i++) {
> >>                 clauses[i] = new SpanTermQuery(phraseQueryTerms[i]);
> >>             }
> >>
> >>             SpanNearQuery sp = new SpanNearQuery(clauses,
> >>                     ((PhraseQuery) query).getSlop(), false);
> >>             sp.setBoost(query.getBoost());
> >>
> >> I don't think it is perfect logic for PhraseQuery's edit distance, but
> >> it approximates extremely well in most cases.
> >>
> >> I wonder if this approach to Highlighting would be worth it in the end.
> >> Certainly, it would seem to require that you store offsets or you would
> >> have to re-tokenize anyway.
> >>
> >> Some more interesting "stuff" on the current Highlighter methods:
> >>
> >> We can gain a lot of speed on the implementation of the current
> >> Highlighter if we grab from the source text in bigger chunks. Ronnie's
> >> Highlighter appears to be faster than the original due to two things:
> he
> >> doesn't have to re-tokenize text and he rebuilds the original document
> >> in large pieces. Depending on how you want to look at it, he loses most
> >> of the speed gained from just looking at the Query tokens instead of
> all
> >> tokens to pulling the Term offset information (which appears pretty
> slow).
> >>
> >> If you use a SimpleAnalyzer on docs around 1800 tokens long, you can
> >> actually match the speed of Ronnies highlighter with the current
> >> highlighter if you just rebuild the highlighted documents in bigger
> >> pieces i.e. instead of going through each token and adding the source
> >> text that it covers, build up the offset information until you get
> >> another hit and then pull from the source text into the highlighted
> text
> >> in one big piece rather than a tokens worth at a time. Of course this
> is
> >> not compatible with the way the Fragmenter currently works. If you use
> >> the StandardAnalyzer instead of SimpleAnalyzer, Ronnie's highlighter
> >> wins because it takes so darn long to re-analyze.
> >>
> >> It is also interesting to note that it is very difficult to see in a
> >> gain in using TokenSources to build a TokenStream. Using the
> >> StandardAnalyzer, it takes docs that are 1800 tokens just to be as fast
> >> as re-analyzing. Notice I didn't say fast, but "as fast". Anything
> >> smaller, or if you're using a simpler analyzer, and TokenSources is
> >> certainly not worth it. It just takes too long to pull TermVector info.
> >>
> >> - Mark
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message