lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Lee <lee.justi...@gmail.com>
Subject Re: Getting a list of matching terms and offsets
Date Sun, 05 Jun 2016 19:37:04 GMT
Thanks, yea, I looked at debug query too.  Unfortunately the output of
debug query doesn't quite do it.  For example, if you use a wildcard query,
it will simply explain the score associated with that wildcard query, not
the actual matching token.  In order words, if you search for "hour*" and
the actual matching text is "hours", debug query doesn't tell you that.
Instead, it just reports the score associated with "hour*".

The closest example I've ever found is this:

https://lucidworks.com/blog/2013/05/09/update-accessing-words-around-a-positional-match-in-lucene-4/

But this kind of approach won't let me use the full power of the Solr
ecosystem.  I'd basically be back to dealing with Lucene directly, which I
think is a step backwards.  I think the right approach is to write my own
SearchComponent, using the highlighter as a starting point.  But I wanted
to make sure there wasn't a simpler way.

On Sun, Jun 5, 2016 at 11:30 AM Ahmet Arslan <iorixxx@yahoo.com.invalid>
wrote:

> Well debug query has the list of token that caused match.
> If i am not mistaken i read an example about span query and spans thing.
> It was listing the positions of the matches.
> Cannot find the example at the moment..
>
> Ahmet
>
>
>
> On Sunday, June 5, 2016 9:10 PM, Justin Lee <lee.justin.m@gmail.com>
> wrote:
> Thanks for the responses Alex and Ahmet.
>
> The TermVector component was the first thing I looked at, but what it gives
> you is offset information for every token in the document.  I'm trying to
> get a list of tokens that actually match the search query, and unless I'm
> missing something, the TermVector component doesn't give you that
> information.
>
> The TermSpans class does contain the right information, but again the hard
> part is: how do I reliably get a list of TokenSpans for the tokens that
> actually match the search query?  That's why I ended up in the highlighter
> source code, because the highlighter has to do just this in order to create
> snippets with accurate highlighting.
>
> Justin
>
>
> On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan <iorixxx@yahoo.com.invalid>
> wrote:
>
> > Hi,
> >
> > May be org.apache.lucene.search.spans.TermSpans ?
> >
> >
> >
> > On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <
> arafalov@gmail.com>
> > wrote:
> > It sounds like TermVector component's output:
> >
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> >
> > Perhaps with additional flags enabled (e.g. tv.offsets and/or
> > tv.positions).
> >
> > Regards,
> >    Alex.
> > ----
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> >
> > On 5 June 2016 at 07:39, Justin Lee <lee.justin.m@gmail.com> wrote:
> > > Is anyone aware of a way of getting a list of each matching token and
> > their
> > > offsets after executing a search?  The reason I want to do this is
> > because
> > > I have the physical coordinates of each token in the original document
> > > stored out of band, and I want to be able to highlight in the original
> > > document.  I would really like to have Solr return the list of matching
> > > tokens because then things like stemming and phrase matching will work
> as
> > > expected. I'm thinking of something like the highlighter component,
> > except
> > > instead of returning html, it would return just the matching tokens and
> > > their offsets.
> > >
> > > I have googled high and low and can't seem to find an exact answer to
> > this
> > > question, so I have spent the last few days examining the internals of
> > the
> > > various highlighting classes in Solr and Lucene.  I think the bulk of
> the
> > > action is in WeightedSpanTermExtractor and its interaction with
> > > getBestTextFragments in the Highlighter class.  But before I spend
> > anymore
> > > time on this I thought I'd ask (1) whether anyone knows of an easier
> way
> > of
> > > doing this, and (2) whether I'm at least barking up the right tree.
> > >
> > > Thanks much,
> > > Justin
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message