Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 3507 invoked from network); 19 Mar 2004 10:26:46 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 19 Mar 2004 10:26:46 -0000 Received: (qmail 65747 invoked by uid 500); 19 Mar 2004 10:26:15 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 65715 invoked by uid 500); 19 Mar 2004 10:26:15 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 65684 invoked from network); 19 Mar 2004 10:26:13 -0000 Received: from unknown (HELO idlewild.ccnep.com.np) (202.51.64.130) by daedalus.apache.org with SMTP; 19 Mar 2004 10:26:13 -0000 Received: from chandan ([202.51.64.153]) by idlewild.ccnep.com.np (8.12.5/8.12.5) with SMTP id i2JBAt4J001761 for ; Fri, 19 Mar 2004 16:56:04 +0545 Message-ID: <005101c40d9c$9aa74810$2403a8c0@chandan> From: "Chandan Tamrakar" To: "Lucene Users List" References: <5.2.1.1.0.20040318141516.01e03d58@mail.novell.com> Subject: Re: Indexing HTML Date: Fri, 19 Mar 2004 16:11:10 +0545 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4922.1500 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4925.2800 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N How do I index a HTM document which may have any encoding like EUC,SJIS,Western European or UTF 8. Can I parse and extract the html into string and than convert into Text file in UNICODE ? Is this an appropiate way to index HTML files ? Can anyone suggest me a simple parser other than a parser found in demo of lucene ? Also how do i find the "encoding " of files ? Whenever there are ANSI text files containing japanese characters i am not able to convert into UTF-16 lucene is indexing properly when I convert into SJIS thnks chandan --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org