lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Davids <>
Subject Re: Only highlight terms that caused a search hit/match
Date Wed, 19 Feb 2014 13:30:04 GMT
Thanks for the responses, wanted to give everyone an update on this...

I did confirm the nested proximity problem, my apologies that the example I
gave actually works, though there are plenty of other cases that are
problematic. I created a new Jira ticket
encompassing the issue and attached test cases that demonstrates the
problem. I do believe this issue can be resolved with some smart recursion
in the WeightedSpanTermExtractor.extractWeightedSpanTerms method that walks
the SpanQuery tree.

As for the attempt of omitting terms that doesn't produce a search hit -- I
attempted to trim the query tree by putting a simple guard statement in the
WeightedSpanTermExtractor.extract method that would execute a search
against the in memory index for the passed query. Since the extract method
is being called recursively, the branches of the query tree were getting
trimmed as expected. This actually worked great for the simple "a OR (b AND
z)" case! Though, the entire thing came to a grinding halt with the
realization that I needed to continue to support other fielded searches.
When attempting to highlight using a fielded search "foo:bar AND (a OR (b
AND z))" I received 0 highlights as the entire query was being pruned since
I didn't actually have the "foo" field indexed, only the "default" field.
Also, since the index that we maintain comes from disparate data sources,
it isn't easy for us to rebuild an exact replica memory index for this to
completely work. I have been toying with the idea of turning any
non-default fielded search into a MatchAllDocsQuery to essentially give a
1=1 for fielded searches that I can't verify (still a best guess highlight
attempt - just a better guess).

Hopefully these thoughts can trigger more discussion / recommendations  as
I am attempting to mimic the Google cached results page that shows the
entire original document and highlights all of the matching query terms
(that produced the hit ideally).



P.S. Here is the patch for pruning the query space (suggestions welcome for
making this work with non-default fielded queries):

@@ -70,6 +70,7 @@
   private boolean wrapToCaching = true;
   private int maxDocCharsToAnalyze;
   private AtomicReader internalReader = null;
+  private IndexSearcher internalSearcher = null;

   public WeightedSpanTermExtractor() {
@@ -91,6 +92,11 @@
    * @throws IOException If there is a low-level I/O error
   protected void extract(Query query, Map<String,WeightedSpanTerm> terms)
throws IOException {
+    //Prune the search tree to only retrieve terms that produce a hit.
+    if(getSearcher().search(query, 1).totalHits == 0) {
+      return;
+    }
     if (query instanceof BooleanQuery) {
       BooleanClause[] queryClauses = ((BooleanQuery) query).getClauses();

@@ -366,6 +372,14 @@
     return internalReader.getContext();

+  protected IndexSearcher getSearcher() throws IOException {
+    if(internalSearcher == null) {
+      internalSearcher = new IndexSearcher(getLeafContext());
+    }
+    return internalSearcher;
+  }
    * This reader will just delegate every call to a single field in the
    * AtomicReader. This way we only need to build this field once rather

On Sun, Feb 16, 2014 at 1:35 PM, Rose, Stuart J <>wrote:

> Hi Steve,
> We leveraged the SpanQuery and Highlighting APIs in 3.5 a couple of years
> ago to do this. In order to get accurate doc hits for the types of phrases
> that we needed to support search on, we defined a phrase query syntax and
> then implemented a span query parser to create a nested structure of span
> operations that embody the query.
> The test output below gives the span structure that we generate and then
> the resulting highlights for each query.
>         spanOr([text:a, spanNear([text:b, text:z], 987654321, false)])
>         <B>a</B> b c
>         spanNear([spanNear([text:x, text:y, text:z], 0, true), text:a],
> 10, false)
>         y z <B>x</B> <B>y</B> <B>z</B> <B>a</B>
> I'll check to see if we can make it available as a starting point for what
> Mike is suggesting.
> In the meantime, I recommend verifying that each span query is created as
> intended, keeping in mind that doc hits may be 'valid', but might have
> matched for the wrong reason and therefore have mismatched highlighting.
> Stuart
> -----Original Message-----
> From: Michael McCandless []
> Sent: Saturday, February 15, 2014 2:54 AM
> To: Lucene Users
> Cc:
> Subject: Re: Only highlight terms that caused a search hit/match
> Unfortunately, all Lucene's highlighters are "approximate" in this
> regard: there is no guarantee that the shown snippets, if they were a
> single little document, would have matched the query.
> Even the newest highlighter, PostingsHighlighter, doesn't look at
> positions, e.g. a PhraseQuery highlight could be "wrong", though
> "typically" the snippets with all terms from the phrase will scorer higher
> and be more likely to be picked in practice.
> Net/net I think a "precise highlighter", would be a nice addition to
> Lucene, but it is a challenge because you need to turn every leaf query
> into a positional query, even queries like TermQuery that normally don't
> touch positions, and then you need to follow the query tree while you
> highlight so that in your first example a OR (b AND z), having picked a
> snippet or two for a, you then also go and pick a snippet or two for the b
> AND z clause, and then present them both together.
> It's a hard problem but it would make a great addition.
> Mike McCandless
> On Fri, Feb 14, 2014 at 7:05 PM, Steve Davids <> wrote:
> > Hello,
> >
> > I have recently been given a requirement to improve document highlights
> within our system. Unfortunately, the current functionality gives more of a
> best-guess on what terms to highlight vs the actual terms to highlight that
> actually did perform the match. A couple examples of issues that were found:
> >
> > Nested boolean clause with a term that doesn't exist ANDed with a term
> > that does highlights the ignored term in the query
> > Text: a b c
> > Logical Query: a OR (b AND z)
> > Result: <b>a</b> <b>b</b> c
> > Expected: <b>a</b> b c
> > Nested span query doesn't maintain the proper positions and offsets
> > Text: y z x y z a
> > Logical Query: ("x y z", a) span near 10
> > Result: <b>y</b> <b>z</b> <b>x</b> <b>y</b>
<b>z</b> <b>a</b>
> > Expected: y z <b>x</b> <b>y</b> <b>z</b> <b>a</b>
> >
> > I am currently using the Highlighter with a QueryScorer and a
> SimpleSpanFragmenter. While looking through the code it looks like the
> entire query structure is dropped in the WeightedSpanTermExtractor by just
> grabbing any positive TermQuery and flattening them all into a simple Map
> which is then passed on to highlight all of those terms. I believe this
> over simplification of term extraction is the crux of the issue and needs
> to be modified in order to produce more "exact" highlights.
> >
> > I was brainstorming with a colleague and thought perhaps we can spin up
> a MemoryIndex to index that one document and start performing a depth-first
> search of all queries within the overall Lucene query graph. At that point
> we can start querying the MemoryIndex for leaf queries and start walking
> back up the tree, pruning branches that don't result in a search hit which
> results in a map of actual matched query terms. This approach seems pretty
> painful but will hopefully produce better matches. I would like to see what
> the experts on the mailing list would have to say about this approach or is
> there a better way to retrieve the query terms & positions that produced
> the match? Or perhaps there is a different Highlighter implementation that
> should be used, though our user queries are extremely complex with a lot of
> nested queries of various types.
> >
> > Thanks,
> >
> > -Steve
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message