lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominator" <galf...@x-cago.com>
Subject HTMLParser or unicode problem???
Date Mon, 30 Sep 2002 07:18:13 GMT
Hi,
I want to include the support for Danish in the HTMLParser of Lucene.
Workflow:

1) In the HTMLParser.jj I have added this to a token: < #LET:
["A"-"Z","a"-"z","0"-"9","æ","å","Ø","ø","Å","Æ"] >
2) In the StandarTokenizer.jj I have added "\u0080"-"\u00ff",
"\u0100"-"\u017f", "\u0180"-"\u024f", "\u00c0"-"\u00d6" to the "#LETTER:"
tag


When I search (after compiling and indexing) after words with special
characters in it, the search engine can't find them. For example: when I
search for "civilingeniør", I will get no result. When I write out the
comment string of the result (after another search of a word nearby), I see
this in my browser: "civilingeniør". So somehow the parser is mapping the
wrong characters.. What am I doing wrong?


Thx..






--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message