lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fred Toth <ft...@synernet.com>
Subject demo IndexHTML parser breaks unicode?
Date Fri, 24 Sep 2004 17:58:47 GMT
Hi,

I was hoping it wouldn't come to this:

I've got unicode in my source HTML. In particular, within meta tags,
and it's getting broken by the indexer. Note that I'm not trying to
query on any of this, just store and retrieve document titles with
unicode characters.

Has anyone else experienced this? I know this is just a demo, but
it's been working really well and I hate to give it up!

Is this easily fixable? I'm a little worried by this comment in
SimpleCharStream.java:

/**
  * An implementation of interface CharStream, where the stream is assumed to
  * contain only ASCII characters (without unicode processing).
  */

This is likely a show-stopper for me on this parser.

Can anyone recommend the shortest path to another HTML parser
that is unicode friendly?

Thanks for anything.

Fred


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message