Return-Path: Delivered-To: apmail-lucene-solr-dev-archive@minotaur.apache.org Received: (qmail 2285 invoked from network); 16 Oct 2009 22:19:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Oct 2009 22:19:54 -0000 Received: (qmail 90465 invoked by uid 500); 16 Oct 2009 22:19:54 -0000 Delivered-To: apmail-lucene-solr-dev-archive@lucene.apache.org Received: (qmail 90398 invoked by uid 500); 16 Oct 2009 22:19:53 -0000 Mailing-List: contact solr-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-dev@lucene.apache.org Received: (qmail 90386 invoked by uid 99); 16 Oct 2009 22:19:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Oct 2009 22:19:53 +0000 X-ASF-Spam-Status: No, hits=-10.5 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Oct 2009 22:19:51 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 492FA234C045 for ; Fri, 16 Oct 2009 15:19:31 -0700 (PDT) Message-ID: <1899310234.1255731571285.JavaMail.jira@brutus> Date: Fri, 16 Oct 2009 15:19:31 -0700 (PDT) From: "Anders Melchiorsen (JIRA)" To: solr-dev@lucene.apache.org Subject: [jira] Commented: (SOLR-1394) HTML stripper is splitting tokens In-Reply-To: <1592575294.1251555932757.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766735#action_12766735 ] Anders Melchiorsen commented on SOLR-1394: ------------------------------------------ Thanks, that sounds great. There is an existing off-by-one error in the numWhitespace calculation with hexadecimal numeric entities. I noticed that while reworking the patch, but did not bother to report it in here because I was annoyed from being ignored. Now you got me in a better mood, so I can fix that error if you like? > HTML stripper is splitting tokens > --------------------------------- > > Key: SOLR-1394 > URL: https://issues.apache.org/jira/browse/SOLR-1394 > Project: Solr > Issue Type: Bug > Components: Analysis > Affects Versions: 1.4 > Reporter: Anders Melchiorsen > Attachments: SOLR-1394.patch, SOLR-1394.patch > > > The Solr HTML stripper is replacing any removed HTML with whitespace. This is to keep offsets correct for highlighting. > However, as was already pointed out in SOLR-42, this means that any token containing an HTML entity will be split into several tokens. That makes the HTML stripper completely unreliable for international text (and any text is potentially interantional). > The current code is actually deficient for BOTH highlighting and indexing, where the previous incarnation (that did not insert spaces) only had problems with highlighting. > The only workaround is to not use entities at all, which is impossible in some situations and inconvenient in most situations. If the client is required to transform entities before handing it to Solr, it might as well be required to also strip tags, and then the HTML stripper would not be needed at all. > Today, we have a better solution that can be used: offset correction. We can then avoid inserting extra whitespace, but still get correct offsets. The attached patch implements just that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.