lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chandan Tamrakar" <>
Subject Re: Indexing HTML
Date Fri, 19 Mar 2004 10:26:10 GMT
How do I index a HTM document which may have any encoding like
EUC,SJIS,Western European or UTF 8. Can  I parse and extract the html into
string and than convert into Text file in UNICODE ?
Is this an appropiate way  to index HTML files ? Can anyone suggest me a
simple parser other than a parser found in demo of lucene ?

Also how do i find the "encoding " of files ? Whenever there are ANSI text
files containing japanese characters i am not able to convert into UTF-16
lucene is indexing properly when I convert into SJIS


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message