lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From wal...@Cyveillance.com
Subject RE: demo IndexHTML parser breaks unicode?
Date Sat, 25 Sep 2004 01:46:23 GMT
In org.apache.lucene.demo.HTMLDocument you need to change the input stream
to use a different encoding.  Replace the fis with this:

fis = new InputStreamReader(new FileInputStream(f), "UTF-16");

-----Original Message-----
From: Fred Toth [mailto:ftoth@synernet.com]
Sent: Friday, September 24, 2004 9:25 PM
To: Lucene Users List
Subject: Re: demo IndexHTML parser breaks unicode?


Sorry, that didn't cure it.

Again, anyone want to point me to the quickest replacement
HTML parser (that's unicode clean)?

Thanks,

Fred

At 03:17 PM 9/24/2004, you wrote:
>On Friday 24 September 2004 19:58, Fred Toth wrote:
>
> > I've got unicode in my source HTML. In particular, within meta tags,
> > and it's getting broken by the indexer. Note that I'm not trying to
> > query on any of this, just store and retrieve document titles with
> > unicode characters.
>
>Please try again with the code from CVS, Christoph Goller committed a fix
>for this problem (at least I think it was this problem) 1-3 weeks ago.
>
>Regards
>  Daniel
>
>--
>http://www.danielnaber.de
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message