lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jingkang Zhang <>
Subject The problem of using Cyber Neko HTML Parser parse HTML files
Date Fri, 18 Feb 2005 06:12:40 GMT
When I was using Cyber Neko HTML Parser parse HTML
files( created by Microsoft word ), if the file
contains HTML built-in entity references(for example:
&nbsp;) , node value may contain unknown character. 

Like this:
source html:
<P class=MsoNormal style="MARGIN: 0cm 0cm 0pt
18pt"><SPAN lang=EN-US style="mso-bidi-font-size:
10.5pt"><FONT face="Times New Roman"><FONT
size=3>-rw-r--r--<SPAN style="mso-spacerun:
yes">&nbsp;&nbsp;&nbsp; </SPAN>1 root<SPAN
style="mso-spacerun: yes">&nbsp;&nbsp;&nbsp;&nbsp;
</SPAN>root<SPAN style="mso-spacerun:
</SPAN>50 Jan 21 16:12

after parsing html:
-rw-r--r--牋?1 root牋牋 root牋牋牋牋牋 50 Jan 21 16:12

How can I avoid it?

Do You Yahoo!?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message