Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 8700 invoked from network); 18 Jun 2010 21:58:49 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 18 Jun 2010 21:58:49 -0000 Received: (qmail 41481 invoked by uid 500); 18 Jun 2010 21:58:48 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 41425 invoked by uid 500); 18 Jun 2010 21:58:47 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 41418 invoked by uid 99); 18 Jun 2010 21:58:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jun 2010 21:58:47 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jun 2010 21:58:44 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o5ILwNjO001667 for ; Fri, 18 Jun 2010 21:58:23 GMT Message-ID: <15891547.86531276898303015.JavaMail.jira@thor> Date: Fri, 18 Jun 2010 17:58:23 -0400 (EDT) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (SOLR-1954) Highlighter component should expose snippet character offsets and the score. In-Reply-To: <11713462.15421276633045310.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/SOLR-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880340#action_12880340 ] Robert Muir commented on SOLR-1954: ----------------------------------- {quote} 1) isn't highlighting fairly fundamentally character based? would you ever want/expect a highlight position to be based on bytes that break up a logical character? 2) being largely ignorant of highlighting, i would say the units should be in whatever the Highlighter currently use when indexing into string values - my understanidng is that it's the same as the start/end offsets in tokens, so if they are char then it's char, if they are bytes, then it's bytes. {quote} Nope, a 'character' in java is utf-16, it cannot even hold a full unicode code point. In other programming languages that might be solr clients, characters and strings might be utf-8, or utf-32. So if offsets are to be returned, its necessary to specify what 'unit' they are measured in. Otherwise, an offset is as useless as saying my house is '4' away from yours... 4 what?! > Highlighter component should expose snippet character offsets and the score. > ---------------------------------------------------------------------------- > > Key: SOLR-1954 > URL: https://issues.apache.org/jira/browse/SOLR-1954 > Project: Solr > Issue Type: New Feature > Components: highlighter > Reporter: David Smiley > Priority: Minor > Attachments: SOLR-1954_start_and_end_offsets.patch > > > The Highlighter Component does not currently expose the snippet character offsets nor the score. There is a TODO in DefaultSolrHighlighter indicating the intention to add this eventually. This information is needed when doing highlighting on external content. The data is there so its pretty easy to output it in some way. The challenge is deciding on the output and its ramifications on backwards compatibility. The current highlighter component response structure doesn't lend itself to adding any new data, unfortunately. I wish the original implementer had some foresight. Unfortunately all the highlighting tests assume this structure. Here is a snippet of the current response structure in Solr's sample data searching for "sdram" for reference: > {code:xml} > > > > CORSAIR ValueSelect 1GB 184-Pin DDR <em>SDRAM</em> Unbuffered DDR 400 (PC 3200) System Memory - Retail > > > > {code} > Perhaps as a little hack, we introduce a pseudo field called text_startCharOffset which is the concatenation of the matching field and "_startCharOffset". This would be an array of ints. Likewise, there would be another array for endCharOffset and score. > Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org