lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Indexing and searching non-latin languages using utf-8
Date Tue, 18 Mar 2003 16:35:56 GMT
Hi all,

I've a matter with indexing then searching docs written in non-latin
languages and encoded in utf-8 (Russian, by example).

I have a web application, with a simple form to search in the contents of
the docs.
When I submit the form, I encode the query term in utf-8 with
encodeURI(String) but I match no doc. I think that is due to a bad indexing
but I'm not sure.

Lucene is normally indexing docs in writing Terms in the 'xxx.tis' file,
encoding it in utf-8, I believe.
So when it reads the file, it correctly gets russian characters (2 bytes)
but when writing them in the index, they seem different (I've listed the
terms in my application console).

If someone has a solution to resolve my problem, all advices are welcome.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message