lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1954) Highlighter component should expose snippet character offsets and the score.
Date Fri, 18 Jun 2010 21:49:23 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880338#action_12880338
] 

Hoss Man commented on SOLR-1954:
--------------------------------

bq. That way the highlighting section remains untouched, with extra stuff in a 'highlighting-extended-info'

that really seems painful -- i think it would be a lot better to just come up with what the
"new" structure should look like that's more flexible, populated it with more/less data based
on what param the user asks for (ie: hl.positions=true) and then make this new structure the
default for all future versions of solr.  Folks who don't want the new types of metadata,
and don't want to change their clients to understand the new structure can add some param
to their defaults to revert the format.  this is how we've dealt with several other changes
in the past where we want the "default" behavior to be differnet for new users, but still
support the old behavior for legacy users

(spellcheck.extendedResults may seem painful because it changes results -- but that's because
it was never intended for you to toggle it on differnet requsts -- it's expected that you'll
set it once and forget it -- the real problem is that it probably should have been made the
default)

bq. The problem with offsets is.... what are the units? utf8 bytes, utf16 units, real characters?


1) isn't highlighting fairly fundamentally character based?  would you ever want/expect a
highlight position to be based on bytes that break up a logical character?
2) being largely ignorant of highlighting, i would say the units should be in whatever the
Highlighter currently use when indexing into string values -- my understanidng is that it's
the same as the start/end offsets in tokens, so if they are char then it's char, if they are
bytes, then it's bytes.

bq. Walter Underwood proposed a good idea of just alternating segments of text for highlighting.

I like that idea, and if structured properly it can still include the "score" for each matching
chunk as metadata,  but some clients are still going to prefer offset metadata -- in particular
the situation where i've got a 20MB text file in external storage and i want display the entire
document with matches highlighted.  returning alternating strings isn't going to really going
to help me unless they aren't truncated - at which point you are returning the entire 20MB
doc (broken up in a bunch of distinct strings) instead of just returning a bunch of numbers
i can use to find the corrisponding points in my local copy of the file.

> Highlighter component should expose snippet character offsets and the score.
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-1954
>                 URL: https://issues.apache.org/jira/browse/SOLR-1954
>             Project: Solr
>          Issue Type: New Feature
>          Components: highlighter
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: SOLR-1954_start_and_end_offsets.patch
>
>
> The Highlighter Component does not currently expose the snippet character offsets nor
the score.  There is a TODO in DefaultSolrHighlighter indicating the intention to add this
eventually.  This information is needed when doing highlighting on external content.  The
data is there so its pretty easy to output it in some way.  The challenge is deciding on the
output and its ramifications on backwards compatibility.  The current highlighter component
response structure doesn't lend itself to adding any new data, unfortunately.  I wish the
original implementer had some foresight.  Unfortunately all the highlighting tests assume
this structure.  Here is a snippet of the current response structure in Solr's sample data
searching for "sdram" for reference:
> {code:xml}
> <lst name="highlighting">
>  <lst name="VS1GB400C3">
>   <arr name="text">
> 	<str>CORSAIR ValueSelect 1GB 184-Pin DDR &lt;em&gt;SDRAM&lt;/em&gt;
Unbuffered DDR 400 (PC 3200) System Memory - Retail</str>
>   </arr>
>  </lst>
> </lst>
> {code}
> Perhaps as a little hack, we introduce a pseudo field called text_startCharOffset which
is the concatenation of the matching field and "_startCharOffset".  This would be an array
of ints.  Likewise, there would be another array for endCharOffset and score.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message