lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: HTML parser??
Date Tue, 03 May 2005 12:28:01 GMT

On May 3, 2005, at 4:35 AM, Bartosch Warzecha wrote:

>
> Hello,
>
> I´m building a search engine for HTML-Dokuments, and I´ve got a  
> HTML-parsing
> problem.
>
> This documents are in german. In this documents are different special
> characters, and different ways of writing this special characters,  
> like "ö",
> "&ouml;" and "&#246". Do somebody know a parsing engine that has no  
> problems
> with all this different ways to write this special characters?

What HTML parser are you using?  Those entity references should not  
be seen by your code once resolved by a parser.  Try NekoHTML.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message