lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1394) HTML stripper is splitting tokens
Date Fri, 16 Oct 2009 22:00:31 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766727#action_12766727
] 

Yonik Seeley commented on SOLR-1394:
------------------------------------

I've been testing this with a bunch of different HTML, and I don't see any places where this
is worse, and it prevents splitting of tokens when it shouldn't.
Given that the splitting is clearly a bug, and that changes to this filter won't affect the
rest of Solr, I plan on committing this shortly.

Things still aren't perfect as far as offsets and highlighting, but this patch makes it no
worse.

I modified the solr.xml document to escape the '&'  and then added the strip char filter
to the text field.
The query was héllo OR hello OR unicode
Before this patch:  Good <em>unicode</em> support: h&#xE9;llo <em>(hell</em>o
with an accent over the e)
After this patch: Good <em>unicode</em> support: <em>h&#xE9;ll</em>o
<em>(hell</em>o with é accent over the e)

> HTML stripper is splitting tokens
> ---------------------------------
>
>                 Key: SOLR-1394
>                 URL: https://issues.apache.org/jira/browse/SOLR-1394
>             Project: Solr
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Anders Melchiorsen
>         Attachments: SOLR-1394.patch, SOLR-1394.patch
>
>
> The Solr HTML stripper is replacing any removed HTML with whitespace. This is to keep
offsets correct for highlighting.
> However, as was already pointed out in SOLR-42, this means that any token containing
an HTML entity will be split into several tokens. That makes the HTML stripper completely
unreliable for international text (and any text is potentially interantional).
> The current code is actually deficient for BOTH highlighting and indexing, where the
previous incarnation (that did not insert spaces) only had problems with highlighting.
> The only workaround is to not use entities at all, which is impossible in some situations
and inconvenient in most situations. If the client is required to transform entities before
handing it to Solr, it might as well be required to also strip tags, and then the HTML stripper
would not be needed at all.
> Today, we have a better solution that can be used: offset correction. We can then avoid
inserting extra whitespace, but still get correct offsets. The attached patch implements just
that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message