lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Benussi" <mark_benu...@hotmail.com>
Subject RE: best html parser for html documents generated by microsoft products
Date Sun, 04 Dec 2005 00:22:09 GMT
I use JTidy also, but not for Lucene parsing. There is no easy way of
handling this, you simply have to remove all crappy Microsoft inserts as
they come.

-----Original Message-----
From: Gaston [mailto:gasi@artentis.com] 
Sent: 03 December 2005 13:49
To: java-user@lucene.apache.org
Subject: best html parser for html documents generated by microsoft products

Hallo,


JTidy is a very good HTMLParser but for HTML Websites made with the help 
of Microssoft Office Products like Word for example it is not optimal. 
Because ist returns "Microsoft specific HTML Tags" instead of only text. 
Or as should I handle HTML Pages with source begins so

"

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<link rel=File-List href="index-Dateien/filelist.xml">

"

like XML Files and using a XML -Parser instead of a HTML-Parser?


I think it should be a HTML page because of

"<meta http-equiv=Content-Type content="text/html; charset=windows-1252">"

I am glad for every kind



Greetings


Gaston



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message