lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chandan Tamrakar" <chan...@ccnep.com.np>
Subject Re: Indexing HTML
Date Fri, 19 Mar 2004 10:26:10 GMT
How do I index a HTM document which may have any encoding like
EUC,SJIS,Western European or UTF 8. Can  I parse and extract the html into
string and than convert into Text file in UNICODE ?
Is this an appropiate way  to index HTML files ? Can anyone suggest me a
simple parser other than a parser found in demo of lucene ?

Also how do i find the "encoding " of files ? Whenever there are ANSI text
files containing japanese characters i am not able to convert into UTF-16
lucene is indexing properly when I convert into SJIS

thnks
chandan



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message