lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: resin and UTF-8 in URLs
Date Thu, 01 Feb 2007 23:18:41 GMT

: > Solr, in my opinion, shouldn't have the string "UTF-8" hardcoded in it
: > anywhere -- not even in the example config: new users shouldn't need to
: > know about have any special solrconfig options that must be (un)set to get
: > Solr to use their servlet container / system default charset.
:
: I strongly disagree. When we use standards like URIs and XML which
: specify UTF-8, we should use UTF-8.

I'm confused:  As far as URI/URLs go, Solr isn't the one decoding them,
and as I said: nothing in the servlet spec suggests that an app has any
say over how the servlet container will decode them, presubably because
they *must* be UTF-8 ... so this is not our problem, and we should go out
of our way to try and force the servlet container to treat the URLs as
utf8.

As for XML, or any other format a user might POST to solr (or ask solr
to fetch from a remote source) what possible reason would we have to only
supporting UTF-8? .. why do you suggest that the XML standard "specify
UTF-8, [so] we should use UTF-8" ... doesn't the XML standard say we
should use the charset specified in the content-type if there is one, and
if not use the encoding specified in the xml header, ie...

	<?xml encoding='EUC-JP'?>

...the only real question in my mind is what to do if user supplied data
has *NO* charset information of any kind ... for XML the spec seems very
clear that in that case you test for UTF-8 or UTF-16 ... but for arbitrary
streams of character data in other formats (CSV, JSON, etc...) it seems
like trysting the servlet container to tell us the default encoding is the
right way to go.



-Hoss


Mime
View raw message