lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wunderw...@netflix.com>
Subject Re: resin and UTF-8 in URLs
Date Fri, 02 Feb 2007 00:28:43 GMT
On 2/1/07 3:18 PM, "Chris Hostetter" <hossman_lucene@fucit.org> wrote:
>
> As for XML, or any other format a user might POST to solr (or ask solr
> to fetch from a remote source) what possible reason would we have to only
> supporting UTF-8? .. why do you suggest that the XML standard "specify
> UTF-8, [so] we should use UTF-8" ... doesn't the XML standard say we
> should use the charset specified in the content-type if there is one, and
> if not use the encoding specified in the xml header, ie...
> 
> <?xml encoding='EUC-JP'?>

The XML spec says that XML parsers are only required to support
UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different
encoding for XML, there is no guarantee that a conforming parser
will accept it.

Ultraseek has been indexing XML for the past nine years, and
I remember a single customer that had XML in a non-standard
encoding. Effectively all real-world XML is in one of the
standard encodings.

The right spec for XML over HTTP is RFC 3023. For text/xml
with no charset spec, the XML must be interpreted as US-ASCII.
>From section 8.5:

   Omitting the charset parameter is NOT RECOMMENDED for text/xml.  For
   example, even if the contents of the XML MIME entity are UTF-16 or
   UTF-8, or the XML MIME entity has an explicit encoding declaration,
   XML and MIME processors MUST assume the charset is "us-ascii".

wunder



Mime
View raw message