Return-Path: Delivered-To: apmail-commons-dev-archive@www.apache.org Received: (qmail 29690 invoked from network); 7 Jul 2009 07:38:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Jul 2009 07:38:12 -0000 Received: (qmail 583 invoked by uid 500); 7 Jul 2009 07:38:21 -0000 Delivered-To: apmail-commons-dev-archive@commons.apache.org Received: (qmail 465 invoked by uid 500); 7 Jul 2009 07:38:21 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 455 invoked by uid 99); 7 Jul 2009 07:38:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jul 2009 07:38:21 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of flamefew@gmail.com designates 209.85.220.217 as permitted sender) Received: from [209.85.220.217] (HELO mail-fx0-f217.google.com) (209.85.220.217) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Jul 2009 07:38:12 +0000 Received: by fxm17 with SMTP id 17so4420010fxm.42 for ; Tue, 07 Jul 2009 00:37:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=qOXcdbycWP4OabuuLH2J1NDF/tCL6RBebjDhKMvQgT0=; b=AyeshhIDHMnvYAfLKnHkZIuZpkGMpUupIz618ccEH21sdVmecKmcDQRKWCaIeJrk+g O9XWO0UL/UFreEyVJb9W3MXUKpfv3VB97FBq2pZdnP/VUCBnuQL993M3zGU7DKb9EO6G rZrP6cFK9ZQj1l/4dQhvmicoSxnXH7f/HbvLk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=volfwxYbDORjIIdxV8qpk+UjFugtPU6dHxLmkDtX5k7UcKzwd8W+ZjdAZFJlcDlsuT MeVcfhwzLMA5I5tgxUHMMUT+eMvo3rWmDZRH/UdU5h22DHzU9w40ACvXcINENvMcjnL4 KYOxzSSflP1efGSZVhXKc4Dop/gSH23+sp+gs= MIME-Version: 1.0 Received: by 10.223.108.196 with SMTP id g4mr2484790fap.36.1246952272023; Tue, 07 Jul 2009 00:37:52 -0700 (PDT) In-Reply-To: <350165.23487.qm@web63308.mail.re1.yahoo.com> References: <31cc37360906300015j60a9215fmee42b7f49649fed@mail.gmail.com> <350165.23487.qm@web63308.mail.re1.yahoo.com> Date: Tue, 7 Jul 2009 00:37:51 -0700 Message-ID: <31cc37360907070037v11d983afm1596f92b96ca4075@mail.gmail.com> Subject: Re: [LANG] Wanted - spec lawyer. From: Henri Yandell To: Commons Developers List Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Jun 30, 2009 at 7:16 AM, John Bollinger wrote: > > > > J=F6rg Schaible wrote: >> As pointed out http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets an= d >> http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets define the valid >> characters for XML 1.0 and 1.1. >> >> However, the escape functionality is actually different. If you transpor= t >> XML (or HTML) in a UTF-8 encoded text file or one encoded by ASCII-7 is = a >> big difference. In the former you don't have to encode anything, while y= ou >> have to encode anything above 0x7f in the latter case. And this applies = to >> XML, HTML or Java source files at equal level. >> >> The character set definition of the two XML versions is a vertical condi= tion >> set. An attempt to encode a character outside the XML definition is >> actually a situation that cannot be handled and should raise an exceptio= n >> (like every XML parser will do anyway). >> >> Therefore the question is, whether (Un)EscapeUtils should actually be an >> instance initialized with the target character encoding. And that raises >> the question how close we're actually at reimplementing >> java.nio.Charset.encode. > > As I understand it, the basic idea of StringEscapeUtils.escapeXml() is to= convert arbitrary character data from memory (a String) into a character s= equence that has the same meaning when it appears literally in XML characte= r data. =A0This is a conversion from character data to character data, so c= haracter encoding is not directly relevant for this use (and this is a fund= amental difference from Charset.encode()). =A0The characters that must be e= scaped for this purpose are well defined by the XML specifications. > > The appearance of an encoding attribute in the xml declaration > notwithstanding, the character encoding of an XML document is a > property of a representation of the document, not a property of the > document itself. =A0There is therefore a *separate*, albeit related, cons= ideration of escaping characters that cannot be expressed in a particular c= haracter encoding, so as to be able to encode the document to a byte sequen= ce without data loss. This is a useful thing to do, and it is compatible wi= th the main objective, but I think it would be well to avoid conflating the= two as an indivisible task. =A0They can be performed in one pass by one me= thod, but they are logically distinct behaviors. > > If StringEscapeUtils wants to support the second use, then it needs a way= for the user to tell it which additional characters to escape. =A0One poss= ibility would be to pass it a Charset which the user intends to apply (late= r) to encode the characters. =A0StringEscapeUtils could then escape those i= nput characters for which Charset.canEncode() returns false. > > Yet another separate question has arisen as to how to handle input charac= ters which cannot appear in any way in a well formed XML (1.0 / 1.1) docume= nt, even as character references (e.g. U+0000). =A0I'm not so certain that = StringEscapeUtils needs to be concerned about that, and it would simplify t= hings immensely if it considered that out of scope. =A0Among other effects,= I believe that would moot the distinction between XML 1.0 and XML 1.1 (and= future versions) for this class. =A0In addition, I strongly suspect that t= here are multiple production applications that (mis)use XML in a way that w= ould be broken if character references to characters outside the XML charac= ter set were flagged as application errors; it would be considerate for Str= ingEscapeUtils to be compatible with such (mis)use. > Thanks J=F6rg and John. Agreed with John that I don't think charsets are a blocker here or API feature. I think it's a use case to be aware of though and might hint at the differing requests. The general aim, I think, should be to get a default behaviour that blends spec-right with what-I-expect-right. Then the framework approach allows us to easily have users who would prefer it be another way put their own methods together. We could even include the Exception throwing as its own translator but not make it a default. So... starting with XML. The simplest claim is that the following should be escaped: & < > ' " Currently we do that and we escape anything above 0x7f. We don't escape any ctrl characters (for example under 0x20). Is there any reason to complicate things further? Can we keep escapeXml on the expected 5 characters, and let users who want to escape more add in more translators or write their own? On the subject of more translators.... There's the existing > 0x7f NumericEntityEscaper. I suspect it might be worth defining that as a 'constant'. It also seems worth defining a NumericEntityEscaper (or some such) for the values less than 0x20 except the special ones of newline etc. Are there particular Exception ones that would be worth defining? An ExceptionTranslator that you can put on the front of the chain to error if any illegal chars are found? Maybe a few of these to match the various bits of BNF in the spec? Hen --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org For additional commands, e-mail: dev-help@commons.apache.org