Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 45796 invoked from network); 25 Sep 2004 15:13:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 25 Sep 2004 15:13:34 -0000 Received: (qmail 50300 invoked by uid 500); 25 Sep 2004 15:15:15 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 50199 invoked by uid 500); 25 Sep 2004 15:15:14 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 50041 invoked by uid 99); 25 Sep 2004 15:15:12 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from [205.158.62.197] (HELO ws6-2.us4.outblaze.com) (205.158.62.197) by apache.org (qpsmtpd/0.28) with SMTP; Sat, 25 Sep 2004 08:15:12 -0700 Received: (qmail 16214 invoked from network); 25 Sep 2004 09:48:08 -0000 Received: from unknown (HELO ?192.168.1.104?) (erik@hatcher.net@68.169.41.169) by ws6-2.us4.outblaze.com with SMTP; 25 Sep 2004 09:48:08 -0000 Mime-Version: 1.0 (Apple Message framework v619) In-Reply-To: <6.0.1.1.2.20040924225656.04129008@fast.synernet.com> References: <63434C14F9A6F74CB36B85033E4C30CA013AAB@hermes.corp.cyveillance.com> <6.0.1.1.2.20040924225656.04129008@fast.synernet.com> Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: <0AC600C7-0ED8-11D9-982D-000393A564E6@hatcher.net> Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: demo IndexHTML parser breaks unicode? Date: Sat, 25 Sep 2004 05:48:24 -0400 To: "Lucene Users List" X-Mailer: Apple Mail (2.619) X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N As for alternative HTML parsers, there are a few notable ones: NekoHTML - Nutch uses it JTidy - My Ant task in the sandbox uses it and HTMLParser All of the above are surely far more battle-tested in production than Lucene's demo parser, and I'd be surprised if they did not correctly handle Unicode. Erik On Sep 24, 2004, at 11:01 PM, Fred Toth wrote: > Hi, > > Thanks for the tip, but that didn't work in my case. Presumably > with this patch, and the changes in CVS, this makes the parser > work with UTF-16. I can't really tell because the index appears > now to be completely UTF-16 and I can't search for anything. > > My input is actually UTF-8 anyway, and if I patch all the streams > to use UTF-8 instead of UTF-16, I get parser errors. > > So I'm stuck. > > Thanks for your help, > > Fred > > At 09:46 PM 9/24/2004, wallen@Cyveillance.com wrote: >> In org.apache.lucene.demo.HTMLDocument you need to change the input >> stream >> to use a different encoding. Replace the fis with this: >> >> fis = new InputStreamReader(new FileInputStream(f), "UTF-16"); >> >> -----Original Message----- >> From: Fred Toth [mailto:ftoth@synernet.com] >> Sent: Friday, September 24, 2004 9:25 PM >> To: Lucene Users List >> Subject: Re: demo IndexHTML parser breaks unicode? >> >> >> Sorry, that didn't cure it. >> >> Again, anyone want to point me to the quickest replacement >> HTML parser (that's unicode clean)? >> >> Thanks, >> >> Fred >> >> At 03:17 PM 9/24/2004, you wrote: >> >On Friday 24 September 2004 19:58, Fred Toth wrote: >> > >> > > I've got unicode in my source HTML. In particular, within meta >> tags, >> > > and it's getting broken by the indexer. Note that I'm not trying >> to >> > > query on any of this, just store and retrieve document titles with >> > > unicode characters. >> > >> >Please try again with the code from CVS, Christoph Goller committed >> a fix >> >for this problem (at least I think it was this problem) 1-3 weeks >> ago. >> > >> >Regards >> > Daniel >> > >> >-- >> >http://www.danielnaber.de >> > >> >--------------------------------------------------------------------- >> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >> For additional commands, e-mail: lucene-user-help@jakarta.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org >> For additional commands, e-mail: lucene-user-help@jakarta.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org