lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Max Lynch <ihas...@gmail.com>
Subject Re: Phrase Highlighting
Date Thu, 21 May 2009 19:09:11 GMT
On Thu, Apr 30, 2009 at 5:16 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Apr 30, 2009 at 12:15 AM, Max Lynch <ihasmax@gmail.com> wrote:
> > You should switch to the SpanScorer (in o.a.l.search.highlighter).
> >> That fragment scorer should only match true phrase matches.
> >>
> >> Mike
> >>
> >
> > Thanks Mike.  I gave it a try and it wasn't working how I expected.  I am
> > using pylucene right now so I can ask them if the implementation is
> > different.  I'm messing around with the lucene unit tests to see exactly
> how
> > the scorer should work.
>
> Can you give more details on what's not working right?


Sorry, the following code is in python, but I can hack a Java thing together
if necessary. HighlighterSpanScorer is the SpanScorer from the highlight
package just renamed to avoid conflict with the other SpanScorer object.

Well what happens is if I use a SpanScorer instead, and allocate it like
such:

            analyzer = StandardAnalyzer([])
            tokenStream = analyzer.tokenStream("contents",
lucene.StringReader(text))
            ctokenStream = lucene.CachingTokenFilter(tokenStream)
            highlighter = lucene.Highlighter(formatter,
lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
            ctokenStream.reset()

            result = highlighter.getBestFragments(ctokenStream, text,
                    2, "...")

 My highlighter is still breaking up words inside of a span.  For example,
if I search for \"John Smith\", instead of the highlighter being called for
the whole "John Smith", it gets called for "John" and then "Smith".



>
> > In the mean time, If I am interested in finding out exactly how many
> times a
> > term was found in a document, what is the best way to go about this?  The
> > way I am doing it right now is using a highlighter and just incrementing
> > counters when a word is found that I'm interested.  I just came across
> > FieldSortedTermVectorMapper that could do something similar.  Is
> > FieldSortedTermVectorMapper something I could use for this?  Is there a
> > better option?
>
>
> Is it really just single terms you need to measure?  (eg, not "how
> many times did phrase XYZ occur in the doc").  If so, then getting the
> term vectors and locating your term in there, should work.  This is
> probably OK if you just do it for each of the hits on the page (like
> 10 hits), but will be way too slow if you try to do it for say all
> docs that matched the query.
>

I see how the term vector might be used.  I can't really tell if there is a
way for me to do a Span check on the words as easily as the highlighter
would do.


-max

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message