Return-Path: X-Original-To: apmail-groovy-users-archive@minotaur.apache.org Delivered-To: apmail-groovy-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BFF7218D22 for ; Tue, 9 Jun 2015 07:19:21 +0000 (UTC) Received: (qmail 7095 invoked by uid 500); 9 Jun 2015 07:19:21 -0000 Delivered-To: apmail-groovy-users-archive@groovy.apache.org Received: (qmail 7062 invoked by uid 500); 9 Jun 2015 07:19:21 -0000 Mailing-List: contact users-help@groovy.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@groovy.incubator.apache.org Delivered-To: mailing list users@groovy.incubator.apache.org Received: (qmail 7051 invoked by uid 99); 9 Jun 2015 07:19:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jun 2015 07:19:21 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of glaforge@gmail.com designates 209.85.218.49 as permitted sender) Received: from [209.85.218.49] (HELO mail-oi0-f49.google.com) (209.85.218.49) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jun 2015 07:17:06 +0000 Received: by oihb142 with SMTP id b142so6088267oih.3 for ; Tue, 09 Jun 2015 00:18:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=2/Mk2kWu+99mFXYjGZX21zKOd/UBV8+TPSMELic0rwc=; b=tZn1GcuydUZIr+LMPzLTCoVyudCMSmwDgmxrIv5pxs1WTiXPqp/wyjj/D3BTrSGZum utioufDOA9wDu2oSmrIypYLOUTBijgr+Kgs4rtfzWd+31SIXS9fwSQHsQSZ9nJVQ3MNr 0H8lTtxMF0k4aeMIwZGumsrYQXIoazwOSwts/sKS/l0hrZ50XtwqXSX7RSro3qMH+etM tN2gIlGXBwq8Ff9Ivzh6HnHh9ziRjVoG4MoOEsw0x3iUg64JmHtZOSyBLwjwKRRB1jwO JhSk2BcIAGM3OhabnvtTls7qKkhSji2gqJ3ZOk2BAa/Uj8XPeZbetsgPFcSIKLEI74v5 syQQ== MIME-Version: 1.0 X-Received: by 10.202.224.195 with SMTP id x186mr16947823oig.48.1433834333791; Tue, 09 Jun 2015 00:18:53 -0700 (PDT) Received: by 10.182.230.168 with HTTP; Tue, 9 Jun 2015 00:18:53 -0700 (PDT) In-Reply-To: References: Date: Tue, 9 Jun 2015 09:18:53 +0200 Message-ID: Subject: Re: UTF16 BOM in new PrintWriter() vs withPrintWriter() From: Guillaume Laforge To: users@groovy.incubator.apache.org Content-Type: multipart/alternative; boundary=001a113d304221f8140518109225 X-Virus-Checked: Checked by ClamAV on apache.org --001a113d304221f8140518109225 Content-Type: text/plain; charset=UTF-8 For that point, perhaps it's a limitation of Java itself not recognizing that alias? 2015-06-08 23:41 GMT+02:00 Keegan Witt : > Another point of interest is that the current code doesn't respect > aliases. For example, the charset string "UTF_16LE" will not write the > BOM, despite being an alias for "UTF-16LE" > > -Keegan > On Jun 8, 2015 5:20 PM, "Keegan Witt" wrote: > >> The code as-is today writes the BOM regardless of platform. I just >> tested in Linux with the same results. I think there are 2 parts to the >> question of "what's the correct behavior?" >> >> 1. Should the BOM be written at all, particularly when the platform is >> Windows? >> 2. Should the behavior of *withPrintWriter* differ (even if the >> difference is to be smarter) from the behavior of *new PrintWriter*? >> >> *Discussion* >> 1. Strictly speaking, yes. Because RFC 2781 >> states in section 4.3 to assume big >> endian if there is no BOM. However, in practice, many applications >> disregard the RFC and assume little-endian because that's what Windows >> does >> . >> Because of this, the behavior could be changed so that when writing >> UTF-16LE on Windows, it doesn't write the BOM. But in my opinion, it's >> best practice to always write a BOM when working with UTF-16, and Java >> should have done this in their implementation of their PrintWriter. >> >> 2. This is a tough one. Arguably, *withPrintWriter* is doing the >> smarter, more correct behavior, but the typical user would assume this is >> just a shorthand convenience for newing up a PrintWriter (I certainly >> did). So the question is, is it better to just document this difference in >> the GroovyDoc? Or to change the behavior to be closer to Java? And if the >> latter, what breakages would that cause within Groovy itself? Making that >> change could break folks in production, because they could rely on that BOM >> being there, in cases for example where the file is created on Windows, but >> then processed on Linux or when working with a third party library that is >> more picky about the presence of a BOM. >> >> -Keegan >> >> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge >> wrote: >> >>> Now... is it what should be done or not is the good question to ask :-) >>> Does Windows manages to open UTF-16 files without BOMs? >>> >>> 2015-06-08 22:17 GMT+02:00 Keegan Witt : >>> >>>> I forgot to mention that. Yes, I ran the test mentioned in Windows. >>>> >>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge >>>> wrote: >>>> >>>>> That's a good question. >>>>> I guess this is happening on Windows? (I haven't tried here, since I'm >>>>> on OS X) >>>>> I think BOMs were mandatory in text files on Windows. >>>>> >>>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt : >>>>> >>>>>> I've always taken a perverse pleasure in character encoding >>>>>> problems. I was intrigued by this SO question >>>>>> on >>>>>> UTF 16 BOMs in Java vs Groovy. >>>>>> >>>>>> It appears using withPrintWriter(charset) produces a BOM whereas new >>>>>> PrintWriter(file, charset) does not. As demonstrated here: >>>>>> >>>>>> File file = new File("tmp.txt")try { >>>>>> String text = " " >>>>>> String charset = "UTF-16LE" >>>>>> >>>>>> file.withPrintWriter(charset) { it << text } >>>>>> println "withPrintWriter" >>>>>> file.getBytes().each { System.out.format("%02x ", it) } >>>>>> >>>>>> PrintWriter w = new PrintWriter(file, charset) >>>>>> w.print(text) >>>>>> w.close() >>>>>> println "\n\nnew PrintWriter" >>>>>> file.getBytes().each { System.out.format("%02x ", it) }} finally { >>>>>> file.delete()} >>>>>> >>>>>> Outputs >>>>>> >>>>>> withPrintWriter >>>>>> ff fe 20 00 >>>>>> >>>>>> new PrintWriter >>>>>> 20 00 >>>>>> >>>>>> >>>>>> Is this difference in behavior intentional? It seems kinda odd to me. >>>>>> >>>>>> -Keegan >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Guillaume Laforge >>>>> Groovy Project Manager >>>>> Product Ninja & Advocate at Restlet >>>>> >>>>> Blog: http://glaforge.appspot.com/ >>>>> Social: @glaforge / Google+ >>>>> >>>>> >>>> >>>> >>> >>> >>> -- >>> Guillaume Laforge >>> Groovy Project Manager >>> Product Ninja & Advocate at Restlet >>> >>> Blog: http://glaforge.appspot.com/ >>> Social: @glaforge / Google+ >>> >>> >> >> -- Guillaume Laforge Groovy Project Manager Product Ninja & Advocate at Restlet Blog: http://glaforge.appspot.com/ Social: @glaforge / Google+ --001a113d304221f8140518109225 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
For that point, perhaps it's a limitation of Java itse= lf not recognizing that alias?

2015-06-08 23:41 GMT+02:00 Keegan Witt <keeganwitt@g= mail.com>:

A= nother point of interest is that the current code doesn't respect alias= es.=C2=A0 For example, the charset string "UTF_16LE" will not wri= te the BOM, despite being an alias for "UTF-16LE"

-Keegan

On Jun 8, 2015 5:20 PM, "Keegan Witt" = <keeganwitt@gm= ail.com> wrote:
The code as-is today writes the BOM regardless of platf= orm.=C2=A0 I just tested in Linux with the same results.=C2=A0 I think ther= e are 2 parts to the question of "what's the correct behavior?&quo= t;

1.=C2=A0 Should the BOM be written at all, parti= cularly when the platform is Windows?
2.=C2=A0 Should the beh= avior of withPrintWriter differ (even if the difference is to be sma= rter) from the behavior of new PrintWriter?

=
Discussion
1.=C2=A0 Strictly speaking, yes.=C2=A0 Bec= ause RFC 2= 781 states in section 4.3 to assume big endian if there is no BOM.=C2= =A0 However, in practice, many applications disregard the RFC and assume li= ttle-endian because that's what Windows does.=C2=A0 Because of this, the behavior could be changed so= that when writing UTF-16LE on Windows, it doesn't write the BOM.=C2=A0= But in my opinion, it's best practice to always write a BOM when worki= ng with UTF-16, and Java should have done this in their implementation of t= heir PrintWriter.

2.=C2=A0 This is a tough one.=C2= =A0 Arguably,=C2=A0withPrintWriter is doing the smarter, more correc= t behavior, but the typical user would assume this is just a shorthand conv= enience for newing up a PrintWriter (I certainly did).=C2=A0 So the questio= n is, is it better to just document this difference in the GroovyDoc?=C2=A0= Or to change the behavior to be closer to Java?=C2=A0 And if the latter, w= hat breakages would that cause within Groovy itself?=C2=A0 Making that chan= ge could break folks in production, because they could rely on that BOM bei= ng there, in cases for example where the file is created on Windows, but th= en processed on Linux or when working with a third party library that is mo= re picky about the presence of a BOM.

-Keegan

On M= on, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <glaforge@gmail.com>= wrote:
Now... is= it what should be done or not is the good question to ask :-)
Does Win= dows manages to open UTF-16 files without BOMs?

2015-06-08 22:17 GMT+02= :00 Keegan Witt <keeganwitt@gmail.com>:
I forgot to mention that.=C2=A0 Yes, I ra= n the test mentioned in Windows.
=
On Mon, Jun 8, 2015 at 3:54 PM, Guillaume La= forge <glaforge@gmail.com> wrote:
That's a good question.
I guess this is h= appening on Windows? (I haven't tried here, since I'm on OS X)
I think BOMs were mandatory in text files on Windows.

2015-06-08 = 17:53 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
I've always taken a perve= rse pleasure in character encoding problems.=C2=A0 I was intrigued by this SO question=C2=A0on= UTF 16 BOMs in Java vs Groovy.

It appears using withPrintWriter(charset) produces a BOM= whereas new PrintWriter(file, charset)= does not.=C2=A0 As demonstrated here:

File file =3D new File("tmp.txt"<= span style=3D"margin:0px;padding:0px;border:0px;color:rgb(0,0,0)">)<= span style=3D"margin:0px;padding:0px;border:0px;color:rgb(0,0,0)"> try { String text =3D " " String charset =3D "UTF-16LE" file.withPrintWriter(charset) { it << text } println "withPrintWriter" file.getBytes().each { System.out.format("%02x ", it) } PrintWriter w =3D new PrintWriter(file, charset) w.print(text) w.close() println "\n\nnew PrintWriter" file.getBytes().each { System.out.format("%02x ", it) } }= = finally { file.delete() }=

Outputs

withPrintWriter
ff fe 20 00=20

new PrintWriter
20 00

Is this difference in behavior intenti= onal?=C2=A0 It seems kinda odd to me.

-Keegan



<= font color=3D"#888888">--
<= div>
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet





--
=
Guillaume= Laforge
Groovy Project Manager
Product Ninja &= ; Advocate at Restlet<= br>





--
=
=
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet

--001a113d304221f8140518109225--