lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anders Melchiorsen <m...@spoon.kalibalik.dk>
Subject Re: HTML decoder is splitting tokens
Date Sat, 29 Aug 2009 14:39:19 GMT
Koji Sekiguchi <koji@r.email.ne.jp> writes:

> This is correct when you have the mapping definition:
>
> "&lt;" => "<"
> "&gt;" => ">"
>    :              :
>
> But I thought you could not have them, but have only:
>
> "&uuml;" => "ü"
> "&auml;" => "ä"
>    :             :
>
> Didn't it solve your problem?

Hi Koji,

oh, seems like I missed a bit of your suggestion. So you propose to
have mappings for all entities except the troublesome lt, gt, amp?

That should work, as long as it is okay that whitespace follows those
characters. I guess that it will indeed be okay for most situations.

Still, while that is a clever workaround, it doesn't change that the
advertised functionality in the HTML stripper is broken.


I now signed up for JIRA, and created SOLR-1394 for this issue.


Thanks,
Anders.

Mime
View raw message