Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Date: Sun, 5 Jun 2016 18:30:40 +0000 (UTC)
From: Ahmet Arslan <iorixxx@yahoo.com.INVALID>
Reply-To: Ahmet Arslan <iorixxx@yahoo.com>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Message-ID: <639976465.1461258.1465151440344.JavaMail.yahoo@mail.yahoo.com>
In-Reply-To: <CAKQjkweGAYECXc35V3OY1tdEqZZPqZ66ucg8oP76eZMEDxOtHQ@mail.gmail.com>
References: <CAKQjkwfYpLsSxodFU5Ed=zY7PB-74Es8b7Vu3B4jf2-H8d7cRQ@mail.gmail.com> <CAEFAe-EZB8=hJ_Fqxjds1NRkx+LGoV2iUAmVWEJVqLgz27YZ8w@mail.gmail.com> <1288152819.1448269.1465142941306.JavaMail.yahoo@mail.yahoo.com> <CAKQjkweGAYECXc35V3OY1tdEqZZPqZ66ucg8oP76eZMEDxOtHQ@mail.gmail.com>
Subject: Re: Getting a list of matching terms and offsets
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
archived-at: Sun, 05 Jun 2016 18:30:54 -0000

Well debug query has the list of token that caused match.
If i am not mistaken i read an example about span query and spans thing.
It was listing the positions of the matches.
Cannot find the example at the moment..

Ahmet


On Sunday, June 5, 2016 9:10 PM, Justin Lee <lee.justin.m@gmail.com> wrote:
Thanks for the responses Alex and Ahmet.

The TermVector component was the first thing I looked at, but what it gives
you is offset information for every token in the document.  I'm trying to
get a list of tokens that actually match the search query, and unless I'm
missing something, the TermVector component doesn't give you that
information.

The TermSpans class does contain the right information, but again the hard
part is: how do I reliably get a list of TokenSpans for the tokens that
actually match the search query?  That's why I ended up in the highlighter
source code, because the highlighter has to do just this in order to create
snippets with accurate highlighting.

Justin


On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan <iorixxx@yahoo.com.invalid>
wrote:

> Hi,
>
> May be org.apache.lucene.search.spans.TermSpans ?
>
>
>
> On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <arafalov@gmail.com>
> wrote:
> It sounds like TermVector component's output:
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
>
> Perhaps with additional flags enabled (e.g. tv.offsets and/or
> tv.positions).
>
> Regards,
>    Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
>
> On 5 June 2016 at 07:39, Justin Lee <lee.justin.m@gmail.com> wrote:
> > Is anyone aware of a way of getting a list of each matching token and
> their
> > offsets after executing a search?  The reason I want to do this is
> because
> > I have the physical coordinates of each token in the original document
> > stored out of band, and I want to be able to highlight in the original
> > document.  I would really like to have Solr return the list of matching
> > tokens because then things like stemming and phrase matching will work as
> > expected. I'm thinking of something like the highlighter component,
> except
> > instead of returning html, it would return just the matching tokens and
> > their offsets.
> >
> > I have googled high and low and can't seem to find an exact answer to
> this
> > question, so I have spent the last few days examining the internals of
> the
> > various highlighting classes in Solr and Lucene.  I think the bulk of the
> > action is in WeightedSpanTermExtractor and its interaction with
> > getBestTextFragments in the Highlighter class.  But before I spend
> anymore
> > time on this I thought I'd ask (1) whether anyone knows of an easier way
> of
> > doing this, and (2) whether I'm at least barking up the right tree.
> >
> > Thanks much,
> > Justin
>