Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <005101c40d9c$9aa74810$2403a8c0@chandan>
From: "Chandan Tamrakar" <chandan@ccnep.com.np>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
References: <Pine.LNX.4.44.0403031554300.8532-100000@mere.cirano.qc.ca>
 <5.2.1.1.0.20040318141516.01e03d58@mail.novell.com>
Subject: Re: Indexing HTML
Date: Fri, 19 Mar 2004 16:11:10 +0545
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

How do I index a HTM document which may have any encoding like
EUC,SJIS,Western European or UTF 8. Can  I parse and extract the html into
string and than convert into Text file in UNICODE ?
Is this an appropiate way  to index HTML files ? Can anyone suggest me a
simple parser other than a parser found in demo of lucene ?

Also how do i find the "encoding " of files ? Whenever there are ANSI text
files containing japanese characters i am not able to convert into UTF-16
lucene is indexing properly when I convert into SJIS

thnks
chandan


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org