Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6F4A9103E6 for ; Sat, 15 Feb 2014 10:54:41 +0000 (UTC) Received: (qmail 82734 invoked by uid 500); 15 Feb 2014 10:54:39 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 81888 invoked by uid 500); 15 Feb 2014 10:54:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 81880 invoked by uid 99); 15 Feb 2014 10:54:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Feb 2014 10:54:25 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.220.181] (HELO mail-vc0-f181.google.com) (209.85.220.181) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Feb 2014 10:54:20 +0000 Received: by mail-vc0-f181.google.com with SMTP id ie18so9888307vcb.26 for ; Sat, 15 Feb 2014 02:53:59 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type:content-transfer-encoding; bh=4oxRvZ71gMik4oxKNu8BbRURY8G+fzoT3S65lUCk29c=; b=doSoEpRHVPLeQ/QhHTJN9Uj/B0481wjHQJphnR+vLPfLT4eF15H8jtvdclI72OsFII WO7WyITjG2C/SA25pVhdXmV2oRAob1w+X5hGWokA0hu+3YhYyRTp4pMu830YXHrWgTBu gJsXMGRqBeHSwWMx0h18oFnCWemjIJ8ZyzcbQDoRwquyEiAFKZ1Q0zUhlYkc3gbVI2E4 k/mg99f/dzKdI2fARE3R0RGZW//l4ydqOGzBs5qKOl8BNfDuCdsTtaRy2zCbdrmcp/h/ W0DePwTWK+eC1jNFFOrCPVEtTsU5kO4FuLkkuXT9Ta20SSHKT+YIBussh5g67Nr47ujc txJg== X-Gm-Message-State: ALoCoQl0FdFEGdFkpEwwjpQNlk0uhTPc/Q/yILq1Z5Ics9cdGYJkJ27V37JjhVOJioCjqq+v1IsE X-Received: by 10.220.58.202 with SMTP id i10mr4760822vch.23.1392461639872; Sat, 15 Feb 2014 02:53:59 -0800 (PST) MIME-Version: 1.0 Received: by 10.221.5.3 with HTTP; Sat, 15 Feb 2014 02:53:39 -0800 (PST) In-Reply-To: <634749EA-A522-4824-B3C5-521693E38491@gmail.com> References: <634749EA-A522-4824-B3C5-521693E38491@gmail.com> From: Michael McCandless Date: Sat, 15 Feb 2014 05:53:39 -0500 Message-ID: Subject: Re: Only highlight terms that caused a search hit/match To: Lucene Users Cc: sdavids@gmail.com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Unfortunately, all Lucene's highlighters are "approximate" in this regard: there is no guarantee that the shown snippets, if they were a single little document, would have matched the query. Even the newest highlighter, PostingsHighlighter, doesn't look at positions, e.g. a PhraseQuery highlight could be "wrong", though "typically" the snippets with all terms from the phrase will scorer higher and be more likely to be picked in practice. Net/net I think a "precise highlighter", would be a nice addition to Lucene, but it is a challenge because you need to turn every leaf query into a positional query, even queries like TermQuery that normally don't touch positions, and then you need to follow the query tree while you highlight so that in your first example a OR (b AND z), having picked a snippet or two for a, you then also go and pick a snippet or two for the b AND z clause, and then present them both together. It's a hard problem but it would make a great addition. Mike McCandless http://blog.mikemccandless.com On Fri, Feb 14, 2014 at 7:05 PM, Steve Davids wrote: > Hello, > > I have recently been given a requirement to improve document highlights w= ithin our system. Unfortunately, the current functionality gives more of a = best-guess on what terms to highlight vs the actual terms to highlight that= actually did perform the match. A couple examples of issues that were foun= d: > > Nested boolean clause with a term that doesn't exist ANDed with a term th= at does highlights the ignored term in the query > Text: a b c > Logical Query: a OR (b AND z) > Result: a b c > Expected: a b c > Nested span query doesn't maintain the proper positions and offsets > Text: y z x y z a > Logical Query: ("x y z", a) span near 10 > Result: y z x y z a > Expected: y z x y z a > > I am currently using the Highlighter with a QueryScorer and a SimpleSpanF= ragmenter. While looking through the code it looks like the entire query st= ructure is dropped in the WeightedSpanTermExtractor by just grabbing any po= sitive TermQuery and flattening them all into a simple Map which is then pa= ssed on to highlight all of those terms. I believe this over simplification= of term extraction is the crux of the issue and needs to be modified in or= der to produce more "exact" highlights. > > I was brainstorming with a colleague and thought perhaps we can spin up a= MemoryIndex to index that one document and start performing a depth-first = search of all queries within the overall Lucene query graph. At that point = we can start querying the MemoryIndex for leaf queries and start walking ba= ck up the tree, pruning branches that don't result in a search hit which re= sults in a map of actual matched query terms. This approach seems pretty pa= inful but will hopefully produce better matches. I would like to see what t= he experts on the mailing list would have to say about this approach or is = there a better way to retrieve the query terms & positions that produced th= e match? Or perhaps there is a different Highlighter implementation that sh= ould be used, though our user queries are extremely complex with a lot of n= ested queries of various types. > > Thanks, > > -Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org