Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9933F200AC5 for ; Sun, 5 Jun 2016 20:30:54 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 946A2160A28; Sun, 5 Jun 2016 18:30:54 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DD491160A25 for ; Sun, 5 Jun 2016 20:30:53 +0200 (CEST) Received: (qmail 24024 invoked by uid 500); 5 Jun 2016 18:30:52 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 24012 invoked by uid 99); 5 Jun 2016 18:30:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Jun 2016 18:30:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 90DBDC0D08 for ; Sun, 5 Jun 2016 18:30:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.228 X-Spam-Level: X-Spam-Status: No, score=-2.228 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, RP_MATCHES_RCVD=-1.426, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id PCdpDVChvVRE for ; Sun, 5 Jun 2016 18:30:48 +0000 (UTC) Received: from nm3-vm6.bullet.mail.ne1.yahoo.com (nm3-vm6.bullet.mail.ne1.yahoo.com [98.138.91.96]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id EF7055F306 for ; Sun, 5 Jun 2016 18:30:47 +0000 (UTC) Received: from [98.138.100.118] by nm3.bullet.mail.ne1.yahoo.com with NNFMP; 05 Jun 2016 18:30:41 -0000 Received: from [98.138.88.238] by tm109.bullet.mail.ne1.yahoo.com with NNFMP; 05 Jun 2016 18:30:41 -0000 Received: from [127.0.0.1] by omp1038.mail.ne1.yahoo.com with NNFMP; 05 Jun 2016 18:30:41 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 99707.86619.bm@omp1038.mail.ne1.yahoo.com X-YMail-OSG: ySYruE0VM1m89tHXIBPXMMGY9zUo0qi5nDCTcvpDxzvUVwx1h0b_kITGjdZUhln 8Hqpmb8pCRvpOzslRH5vp8Z8pe_McyCIUyaue7d0cR8YlPn6IXdcsHJI_Q7XUW4ZFHbE9_64itKu UHwFlrPpxO0ycHMFA0_4.O_9phEJ83iSJWQ9iOwat5yVDAibapyaSSi7VMu9bczN6JHCPCSqCuJS YXtejdttoSawAD5ZqMxiROZrncSb5yFuUSS6RyjYjfEKDn5ZzjAm2F343GK42gbdNXCiVTKnllY7 9VOAFrBaYiXPqCVVzNiSzRzuFZtdUV0l2LtR5c71bVbzvwhJ3EbbIaREPxYg7hZIkKtZJmOs1Wi5 I5rkuM7uozXkZNThEdYsDPxNZGqsh49FpgSu2xcKc4jP80dVEYMkzPr9m1gBtVHkG2qAOiHQFctR zMyn1ggZe9swBK9r8dWVHdtBRi_rXqZMGG5JJbiDxkVz0p7TaoyV2gyo2qZI3goMiyNAbXlC8joa lxV4- Received: from jws100165.mail.ne1.yahoo.com by sendmailws136.mail.ne1.yahoo.com; Sun, 05 Jun 2016 18:30:40 +0000; 1465151440.656 Date: Sun, 5 Jun 2016 18:30:40 +0000 (UTC) From: Ahmet Arslan Reply-To: Ahmet Arslan To: "solr-user@lucene.apache.org" Message-ID: <639976465.1461258.1465151440344.JavaMail.yahoo@mail.yahoo.com> In-Reply-To: References: <1288152819.1448269.1465142941306.JavaMail.yahoo@mail.yahoo.com> Subject: Re: Getting a list of matching terms and offsets MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit archived-at: Sun, 05 Jun 2016 18:30:54 -0000 Well debug query has the list of token that caused match. If i am not mistaken i read an example about span query and spans thing. It was listing the positions of the matches. Cannot find the example at the moment.. Ahmet On Sunday, June 5, 2016 9:10 PM, Justin Lee wrote: Thanks for the responses Alex and Ahmet. The TermVector component was the first thing I looked at, but what it gives you is offset information for every token in the document. I'm trying to get a list of tokens that actually match the search query, and unless I'm missing something, the TermVector component doesn't give you that information. The TermSpans class does contain the right information, but again the hard part is: how do I reliably get a list of TokenSpans for the tokens that actually match the search query? That's why I ended up in the highlighter source code, because the highlighter has to do just this in order to create snippets with accurate highlighting. Justin On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan wrote: > Hi, > > May be org.apache.lucene.search.spans.TermSpans ? > > > > On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch > wrote: > It sounds like TermVector component's output: > https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component > > Perhaps with additional flags enabled (e.g. tv.offsets and/or > tv.positions). > > Regards, > Alex. > ---- > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > > > > On 5 June 2016 at 07:39, Justin Lee wrote: > > Is anyone aware of a way of getting a list of each matching token and > their > > offsets after executing a search? The reason I want to do this is > because > > I have the physical coordinates of each token in the original document > > stored out of band, and I want to be able to highlight in the original > > document. I would really like to have Solr return the list of matching > > tokens because then things like stemming and phrase matching will work as > > expected. I'm thinking of something like the highlighter component, > except > > instead of returning html, it would return just the matching tokens and > > their offsets. > > > > I have googled high and low and can't seem to find an exact answer to > this > > question, so I have spent the last few days examining the internals of > the > > various highlighting classes in Solr and Lucene. I think the bulk of the > > action is in WeightedSpanTermExtractor and its interaction with > > getBestTextFragments in the Highlighter class. But before I spend > anymore > > time on this I thought I'd ask (1) whether anyone knows of an easier way > of > > doing this, and (2) whether I'm at least barking up the right tree. > > > > Thanks much, > > Justin >