lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fred Toth <ft...@synernet.com>
Subject RE: demo IndexHTML parser breaks unicode?
Date Sat, 25 Sep 2004 03:01:49 GMT
Hi,

Thanks for the tip, but that didn't work in my case. Presumably
with this patch, and the changes in CVS, this makes the parser
work with UTF-16. I can't really tell because the index appears
now to be completely UTF-16 and I can't search for anything.

My input is actually UTF-8 anyway, and if I patch all the streams
to use UTF-8 instead of UTF-16, I get parser errors.

So I'm stuck.

Thanks for your help,

Fred

At 09:46 PM 9/24/2004, wallen@Cyveillance.com wrote:
>In org.apache.lucene.demo.HTMLDocument you need to change the input stream
>to use a different encoding.  Replace the fis with this:
>
>fis = new InputStreamReader(new FileInputStream(f), "UTF-16");
>
>-----Original Message-----
>From: Fred Toth [mailto:ftoth@synernet.com]
>Sent: Friday, September 24, 2004 9:25 PM
>To: Lucene Users List
>Subject: Re: demo IndexHTML parser breaks unicode?
>
>
>Sorry, that didn't cure it.
>
>Again, anyone want to point me to the quickest replacement
>HTML parser (that's unicode clean)?
>
>Thanks,
>
>Fred
>
>At 03:17 PM 9/24/2004, you wrote:
> >On Friday 24 September 2004 19:58, Fred Toth wrote:
> >
> > > I've got unicode in my source HTML. In particular, within meta tags,
> > > and it's getting broken by the indexer. Note that I'm not trying to
> > > query on any of this, just store and retrieve document titles with
> > > unicode characters.
> >
> >Please try again with the code from CVS, Christoph Goller committed a fix
> >for this problem (at least I think it was this problem) 1-3 weeks ago.
> >
> >Regards
> >  Daniel
> >
> >--
> >http://www.danielnaber.de
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message