lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@activemath.org>
Subject Re: UTF-8 indexing and searching
Date Fri, 01 Jul 2005 20:57:59 GMT
Careful that in the http world, there's an amibuity: 
x-www-form-url-encoded does not specify the content-encoding that the 
byts represented in the %-escaped sequences are written with.
That's fixed by the very recent URI spec where absence means utf-8...

My experience was that Tomcat simply converted the bytes of this into 
the first bytes of the 16-bit unicode, therefore working with 
iso-8859-1.
We succeeded receiving forms from pages utf-8-encded by packing an 
inputstreamreader in utf-8 at the end of an inputstream that reads the 
bytes of the string of request.getParam...

Hope that helps.

paul



Le 1 juil. 05, à 22:41, <pierre.conti@vtdim.com> a écrit :

>
> Did you check that the request string you get at the analyzer
> level is corectly encoded as UTF-8?
> We had the same problem with french accentuated char encoded
> also as UTF-8, and transmited by tomcat as ISO-8859-1. It was
> just for a test, also we didn't investgated a lot, but
> re-encode in URL/ISO-8859-1 and re-decode from URL in correct
> UTF-8, and it worked.
> Don't know, if it may help you ...
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message