lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <jan....@cominvent.com>
Subject Re: In-document highlighting DocValues?
Date Tue, 11 Oct 2011 21:45:00 GMT
Hi,

Looking more at the new DocValues for 4.0, they are only per-document, right?

So I guess what I'm thinking is to use the good old Payloads per term to store this info.
Since that's a single value, we could encode the values as byte[] somehow.

But the crucial point here is how to iterate through every single matching term in a field
and pull out the payloads?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 10. okt. 2011, at 16:19, Jan Høydahl wrote:

> Hi,
> 
> We index structured documents, with numbered chapters, paragraphs and sentences. After
doing a (rather complex) search, we may get multiple matches in each result doc. We want to
highlight those matches in our front-end and currently we do a simple string match of the
query words against the raw text.
> 
> However, this highlights some words that do not satisfy the original query, and also
does not highlight other words where the match was in a stem, or synonym or wildcard. We thus
need to improve this, and my plan was to utilize DocValues (Payloads). Would the following
work?
> 
> 1. For each term in the field "text", index DocValues with info about chapter#, paragraph#,
sentence# and word#.
>   This can be done in our application code, e.g. "foo|1,2,3,4" for chapter 1, paragraph
2, sentence 3 and word 4.
> 
> 2. Then, for a specific document in the result list, retrieve a list of all matches in
field "text", and for each match,
>   retrieve the associated DocValues.
> 
> 3. The client application can now use this information to highlight matches, as well
as "jump to next match" etc,
>   and would highlight the correct words only, e.g. it would be able to highlight "colour"
even if the match was on the
>   synonym "color".
> 
> Another use case for this technique would be OCR applications where we store with each
term its x,y offsets for where it occurs in
> the original TIFF image scan.
> 
> What is in already in place and what code needs to be written? I don't currently see
how to get a complete list of matches for a particular document.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
> 


Mime
View raw message