lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Sekiguchi <k...@r.email.ne.jp>
Subject Re: HTML decoder is splitting tokens
Date Sat, 29 Aug 2009 10:02:03 GMT
Anders,

Thank you for the explanation.

 > which could be written in HTML like this:
 >
 > use <tt>&lt;p&gt;</tt> to mark a paragraph

Ok.

 > so the mapping char filter would map it into:
 >
 > use <tt><p></tt> to mark a paragraph

This is correct when you have the mapping definition:

"&lt;" => "<"
"&gt;" => ">"
    :              :

But I thought you could not have them, but have only:

"&uuml;" => "ü"
"&auml;" => "ä"
    :             :

Didn't it solve your problem?

Thank you,

Koji

Anders Melchiorsen wrote:
> Koji Sekiguchi <koji@r.email.ne.jp> writes:
>
>   
>> Thank you for attaching the patch. Sorry again, I don't have enough
>> time to investigate the patch and the problem you have, though, I'd
>> like just to recommend that you'd open a JIRA issue and attach the
>> patch so that I or someone can look into it later.
>>     
>
> Sorry, learning an issue tracker every time I find a bug in some
> project is too much trouble. I wouldn't mind if someone else transfers
> my previous mail, though.
>
>
>   
>> And I didn't understand this part of your previous mail:
>>
>>     
>>> Adding MappingCharFilterFactory in front of the HTML stripper (so
>>> that the latter will not see the entity) does work as expected.
>>> That is, until I try strings like "use &lt;p&gt; to mark a
>>> paragraph", where the HTML stripper will then remove parts of the
>>> actual text. So this approach will not work.
>>>       
>
> Entity mapping and tag removal has to happen in one pass to keep
> fidelity.
>
> Let's say that we are analyzing a tutorial on writing HTML. It might
> contain the text:
>
>     use <p> to mark a paragraph
>
> which could be written in HTML like this:
>
>     use <tt>&lt;p&gt;</tt> to mark a paragraph
>
> so the mapping char filter would map it into:
>
>     use <tt><p></tt> to mark a paragraph
>
> which is already wrong. Next, the HTML stripper would remove the tags:
>
>     use to mark a paragraph
>
> and we have now lost a part of the original text.
>
>
> Cheers,
> Anders.
>
>   


Mime
View raw message