lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <jan....@cominvent.com>
Subject In-document highlighting DocValues?
Date Mon, 10 Oct 2011 14:19:48 GMT
Hi,

We index structured documents, with numbered chapters, paragraphs and sentences. After doing
a (rather complex) search, we may get multiple matches in each result doc. We want to highlight
those matches in our front-end and currently we do a simple string match of the query words
against the raw text.

However, this highlights some words that do not satisfy the original query, and also does
not highlight other words where the match was in a stem, or synonym or wildcard. We thus need
to improve this, and my plan was to utilize DocValues (Payloads). Would the following work?

1. For each term in the field "text", index DocValues with info about chapter#, paragraph#,
sentence# and word#.
   This can be done in our application code, e.g. "foo|1,2,3,4" for chapter 1, paragraph 2,
sentence 3 and word 4.

2. Then, for a specific document in the result list, retrieve a list of all matches in field
"text", and for each match,
   retrieve the associated DocValues.

3. The client application can now use this information to highlight matches, as well as "jump
to next match" etc,
   and would highlight the correct words only, e.g. it would be able to highlight "colour"
even if the match was on the
   synonym "color".

Another use case for this technique would be OCR applications where we store with each term
its x,y offsets for where it occurs in
the original TIFF image scan.

What is in already in place and what code needs to be written? I don't currently see how to
get a complete list of matches for a particular document.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com


Mime
View raw message