Return-Path: Delivered-To: apmail-lucene-solr-dev-archive@locus.apache.org Received: (qmail 97462 invoked from network); 2 Feb 2007 00:29:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Feb 2007 00:29:02 -0000 Received: (qmail 27904 invoked by uid 500); 2 Feb 2007 00:29:08 -0000 Delivered-To: apmail-lucene-solr-dev-archive@lucene.apache.org Received: (qmail 27761 invoked by uid 500); 2 Feb 2007 00:29:08 -0000 Mailing-List: contact solr-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-dev@lucene.apache.org Received: (qmail 27752 invoked by uid 99); 2 Feb 2007 00:29:08 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Feb 2007 16:29:08 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of wunderwood@netflix.com designates 216.35.131.152 as permitted sender) Received: from [216.35.131.152] (HELO mx2.netflix.com) (216.35.131.152) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Feb 2007 16:28:59 -0800 Received: from message.netflix.com (exchangeav [10.1.122.79]) by mx2.netflix.com (8.12.11.20060308/8.12.11) with ESMTP id l120SI2s006789 for ; Thu, 1 Feb 2007 16:28:19 -0800 Received: from Superfly.netflix.com ([10.1.122.93]) by message.netflix.com with Microsoft SMTPSVC(6.0.3790.1830); Thu, 1 Feb 2007 16:28:34 -0800 Received: from 10.2.164.78 ([10.2.164.78]) by superfly.netflix.com ([10.1.122.93]) with Microsoft Exchange Server HTTP-DAV ; Fri, 2 Feb 2007 00:28:33 +0000 User-Agent: Microsoft-Entourage/11.3.3.061214 Date: Thu, 01 Feb 2007 16:28:43 -0800 Subject: Re: resin and UTF-8 in URLs From: Walter Underwood To: Message-ID: Thread-Topic: resin and UTF-8 in URLs Thread-Index: AcdGYRfkVoiy7rJUEdu8OgAUUTF+rA== In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-OriginalArrivalTime: 02 Feb 2007 00:28:34.0198 (UTC) FILETIME=[12A5DB60:01C74661] X-Brightmail-Tracker: AAAAAQAAA+k= X-Language-Identified: TRUE X-Virus-Checked: Checked by ClamAV on apache.org On 2/1/07 3:18 PM, "Chris Hostetter" wrote: > > As for XML, or any other format a user might POST to solr (or ask solr > to fetch from a remote source) what possible reason would we have to only > supporting UTF-8? .. why do you suggest that the XML standard "specify > UTF-8, [so] we should use UTF-8" ... doesn't the XML standard say we > should use the charset specified in the content-type if there is one, and > if not use the encoding specified in the xml header, ie... > > The XML spec says that XML parsers are only required to support UTF-8, UTF-16, ISO 8859-1, and US-ASCII. If you use a different encoding for XML, there is no guarantee that a conforming parser will accept it. Ultraseek has been indexing XML for the past nine years, and I remember a single customer that had XML in a non-standard encoding. Effectively all real-world XML is in one of the standard encodings. The right spec for XML over HTTP is RFC 3023. For text/xml with no charset spec, the XML must be interpreted as US-ASCII. >From section 8.5: Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is "us-ascii". wunder