lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Sokolov <soko...@ifactory.com>
Subject Re: In-document highlighting DocValues?
Date Thu, 13 Oct 2011 14:23:36 GMT
Is there some reason you don't want to leverage Highlighter to do this 
work?  It has all the necessary code for using the analyzed version of 
your query so it will only match tokens that really contribute to the 
search match.

You might also be interested in LUCENE-2878 (which is still under 
development on a branch though).  It aims to provide first-class access 
to payloads and positions during scoring, and this will be very useful 
for complex highlighting tasks.

Another possible solution to the OCR problem could be:  generate an XML 
file with a tag for each word encoding its x,y coords, like : <word 
x="3" y="10">This</word>; index that file using XmlCharFilter or 
HTMLStripCharFilter. Then when you search, use the Solr highlighter to 
highlight the entire document, and process it using XML tools to find 
the locations of the matches.

-Mike

On 10/10/2011 10:19 AM, Jan Høydahl wrote:
> Hi,
>
> We index structured documents, with numbered chapters, paragraphs and sentences. After
doing a (rather complex) search, we may get multiple matches in each result doc. We want to
highlight those matches in our front-end and currently we do a simple string match of the
query words against the raw text.
>
> However, this highlights some words that do not satisfy the original query, and also
does not highlight other words where the match was in a stem, or synonym or wildcard. We thus
need to improve this, and my plan was to utilize DocValues (Payloads). Would the following
work?
>
> 1. For each term in the field "text", index DocValues with info about chapter#, paragraph#,
sentence# and word#.
>     This can be done in our application code, e.g. "foo|1,2,3,4" for chapter 1, paragraph
2, sentence 3 and word 4.
>
> 2. Then, for a specific document in the result list, retrieve a list of all matches in
field "text", and for each match,
>     retrieve the associated DocValues.
>
> 3. The client application can now use this information to highlight matches, as well
as "jump to next match" etc,
>     and would highlight the correct words only, e.g. it would be able to highlight "colour"
even if the match was on the
>     synonym "color".
>
> Another use case for this technique would be OCR applications where we store with each
term its x,y offsets for where it occurs in
> the original TIFF image scan.
>
> What is in already in place and what code needs to be written? I don't currently see
how to get a complete list of matches for a particular document.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
>    

Mime
View raw message