Return-Path: X-Original-To: apmail-groovy-users-archive@minotaur.apache.org Delivered-To: apmail-groovy-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8D9A118D34 for ; Tue, 9 Jun 2015 07:21:59 +0000 (UTC) Received: (qmail 11247 invoked by uid 500); 9 Jun 2015 07:21:59 -0000 Delivered-To: apmail-groovy-users-archive@groovy.apache.org Received: (qmail 11214 invoked by uid 500); 9 Jun 2015 07:21:59 -0000 Mailing-List: contact users-help@groovy.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@groovy.incubator.apache.org Delivered-To: mailing list users@groovy.incubator.apache.org Received: (qmail 11204 invoked by uid 99); 9 Jun 2015 07:21:59 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jun 2015 07:21:59 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E38781A4815 for ; Tue, 9 Jun 2015 07:21:58 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.901 X-Spam-Level: ** X-Spam-Status: No, score=2.901 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 38KihbR8O81V for ; Tue, 9 Jun 2015 07:21:46 +0000 (UTC) Received: from mail-oi0-f42.google.com (mail-oi0-f42.google.com [209.85.218.42]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 51FCB27627 for ; Tue, 9 Jun 2015 07:21:46 +0000 (UTC) Received: by oihd6 with SMTP id d6so6146653oih.2 for ; Tue, 09 Jun 2015 00:21:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=wyfZ4EGPNAjXP/oYht6hpj7HVRCVpQwDLN7EDPlqA6Y=; b=C0vF1GpMmqPm8h23kOeDr82bV1U5Ul+fjTTH/gYd1XKRLN2clqpBRKreQdGL/czT30 Rod1fCt5RfUzE1vo6KQyChz5DS8SOMXULZyPeKvV2rL9W7xMl71Lq7YoHy2sMLZtNLve Y1ykltrRTQ8D+E6pkU/wASgT4qQ+4/izUXEBu/oBxVmfsc0mQGAZdqSSvEK/FdRMT8aG F4ARvsWVJjPqPBSuehe+BBAcjQdAgDpMOWhQwbaqR9wsSZPytbbOHEeo0VBW4qaqJnJS fyfMpt1kAhNzxgtj9SBCPTb5wa1kVr2pUpaByow2JXy9cYQdhUScUTlG/aDNkAqUMnZL sV7w== MIME-Version: 1.0 X-Received: by 10.202.202.80 with SMTP id a77mr17086783oig.118.1433834505657; Tue, 09 Jun 2015 00:21:45 -0700 (PDT) Received: by 10.182.230.168 with HTTP; Tue, 9 Jun 2015 00:21:45 -0700 (PDT) In-Reply-To: References: Date: Tue, 9 Jun 2015 09:21:45 +0200 Message-ID: Subject: Re: UTF16 BOM in new PrintWriter() vs withPrintWriter() From: Guillaume Laforge To: users@groovy.incubator.apache.org Content-Type: multipart/alternative; boundary=001a11352cde60672e0518109c70 --001a11352cde60672e0518109c70 Content-Type: text/plain; charset=UTF-8 >From Groovy's point of view (ie. when you're coding in Groovy), the BOM is automatically discarded when you use one of our reader methods (withReader, etc), so it's transparent whether the BOM is here or not. I tend to think that having the BOM always is a good thing (I even thought that was mandatory), but Groovy should guess the endianness regardless anyway. Happy to hear what others think too about all this though. Guillaume 2015-06-08 23:20 GMT+02:00 Keegan Witt : > The code as-is today writes the BOM regardless of platform. I just tested > in Linux with the same results. I think there are 2 parts to the question > of "what's the correct behavior?" > > 1. Should the BOM be written at all, particularly when the platform is > Windows? > 2. Should the behavior of *withPrintWriter* differ (even if the > difference is to be smarter) from the behavior of *new PrintWriter*? > > *Discussion* > 1. Strictly speaking, yes. Because RFC 2781 > states in section 4.3 to assume big > endian if there is no BOM. However, in practice, many applications > disregard the RFC and assume little-endian because that's what Windows > does > . > Because of this, the behavior could be changed so that when writing > UTF-16LE on Windows, it doesn't write the BOM. But in my opinion, it's > best practice to always write a BOM when working with UTF-16, and Java > should have done this in their implementation of their PrintWriter. > > 2. This is a tough one. Arguably, *withPrintWriter* is doing the > smarter, more correct behavior, but the typical user would assume this is > just a shorthand convenience for newing up a PrintWriter (I certainly > did). So the question is, is it better to just document this difference in > the GroovyDoc? Or to change the behavior to be closer to Java? And if the > latter, what breakages would that cause within Groovy itself? Making that > change could break folks in production, because they could rely on that BOM > being there, in cases for example where the file is created on Windows, but > then processed on Linux or when working with a third party library that is > more picky about the presence of a BOM. > > -Keegan > > On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge > wrote: > >> Now... is it what should be done or not is the good question to ask :-) >> Does Windows manages to open UTF-16 files without BOMs? >> >> 2015-06-08 22:17 GMT+02:00 Keegan Witt : >> >>> I forgot to mention that. Yes, I ran the test mentioned in Windows. >>> >>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge >>> wrote: >>> >>>> That's a good question. >>>> I guess this is happening on Windows? (I haven't tried here, since I'm >>>> on OS X) >>>> I think BOMs were mandatory in text files on Windows. >>>> >>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt : >>>> >>>>> I've always taken a perverse pleasure in character encoding problems. >>>>> I was intrigued by this SO question >>>>> on >>>>> UTF 16 BOMs in Java vs Groovy. >>>>> >>>>> It appears using withPrintWriter(charset) produces a BOM whereas new >>>>> PrintWriter(file, charset) does not. As demonstrated here: >>>>> >>>>> File file = new File("tmp.txt")try { >>>>> String text = " " >>>>> String charset = "UTF-16LE" >>>>> >>>>> file.withPrintWriter(charset) { it << text } >>>>> println "withPrintWriter" >>>>> file.getBytes().each { System.out.format("%02x ", it) } >>>>> >>>>> PrintWriter w = new PrintWriter(file, charset) >>>>> w.print(text) >>>>> w.close() >>>>> println "\n\nnew PrintWriter" >>>>> file.getBytes().each { System.out.format("%02x ", it) }} finally { >>>>> file.delete()} >>>>> >>>>> Outputs >>>>> >>>>> withPrintWriter >>>>> ff fe 20 00 >>>>> >>>>> new PrintWriter >>>>> 20 00 >>>>> >>>>> >>>>> Is this difference in behavior intentional? It seems kinda odd to me. >>>>> >>>>> -Keegan >>>>> >>>> >>>> >>>> >>>> -- >>>> Guillaume Laforge >>>> Groovy Project Manager >>>> Product Ninja & Advocate at Restlet >>>> >>>> Blog: http://glaforge.appspot.com/ >>>> Social: @glaforge / Google+ >>>> >>>> >>> >>> >> >> >> -- >> Guillaume Laforge >> Groovy Project Manager >> Product Ninja & Advocate at Restlet >> >> Blog: http://glaforge.appspot.com/ >> Social: @glaforge / Google+ >> >> > > -- Guillaume Laforge Groovy Project Manager Product Ninja & Advocate at Restlet Blog: http://glaforge.appspot.com/ Social: @glaforge / Google+ --001a11352cde60672e0518109c70 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
From Groovy's point of view (ie. when you're codin= g in Groovy), the BOM is automatically discarded when you use one of our re= ader methods (withReader, etc), so it's transparent whether the BOM is = here or not.

I tend to think that having the BOM always = is a good thing (I even thought that was mandatory), but Groovy should gues= s the endianness regardless anyway.

Happy to hear = what others think too about all this though.

Guill= aume


2015-06-08 23:20 GMT+02:00 Keegan Witt <= keeganwitt@gmail.= com>:
The = code as-is today writes the BOM regardless of platform.=C2=A0 I just tested= in Linux with the same results.=C2=A0 I think there are 2 parts to the que= stion of "what's the correct behavior?"

1.=C2=A0 Should the BOM be written at all, particularly when the platfor= m is Windows?
2.=C2=A0 Should the behavior of withPrintWri= ter differ (even if the difference is to be smarter) from the behavior = of new PrintWriter?

Discussion
1.=C2=A0 Strictly speaking, yes.=C2=A0 Because RFC 2781 states in sectio= n 4.3 to assume big endian if there is no BOM.=C2=A0 However, in practice, = many applications disregard the RFC and assume little-endian because that&#= 39;s what Windows does.=C2=A0 = Because of this, the behavior could be changed so that when writing UTF-16L= E on Windows, it doesn't write the BOM.=C2=A0 But in my opinion, it'= ;s best practice to always write a BOM when working with UTF-16, and Java s= hould have done this in their implementation of their PrintWriter.

2.=C2=A0 This is a tough one.=C2=A0 Arguably,=C2=A0with= PrintWriter is doing the smarter, more correct behavior, but the typica= l user would assume this is just a shorthand convenience for newing up a Pr= intWriter (I certainly did).=C2=A0 So the question is, is it better to just= document this difference in the GroovyDoc?=C2=A0 Or to change the behavior= to be closer to Java?=C2=A0 And if the latter, what breakages would that c= ause within Groovy itself?=C2=A0 Making that change could break folks in pr= oduction, because they could rely on that BOM being there, in cases for exa= mple where the file is created on Windows, but then processed on Linux or w= hen working with a third party library that is more picky about the presenc= e of a BOM.

-Keegan

On M= on, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <glaforge@gmail.com>= wrote:
Now... is= it what should be done or not is the good question to ask :-)
Does Win= dows manages to open UTF-16 files without BOMs?

2015-06-08 22:17 GMT+02= :00 Keegan Witt <keeganwitt@gmail.com>:
I forgot to mention that.=C2=A0 Yes, I ra= n the test mentioned in Windows.
=
On Mon, Jun 8, 2015 at 3:54 PM, Guillaume La= forge <glaforge@gmail.com> wrote:
That's a good question.
I guess this is h= appening on Windows? (I haven't tried here, since I'm on OS X)
I think BOMs were mandatory in text files on Windows.

2015-06-08 = 17:53 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
I've always taken a perve= rse pleasure in character encoding problems.=C2=A0 I was intrigued by this SO question=C2=A0on= UTF 16 BOMs in Java vs Groovy.

It appears using withPrintWriter(charset) produces a BOM= whereas new PrintWriter(file, charset)= does not.=C2=A0 As demonstrated here:

File file =3D new File("tmp.txt"<= span style=3D"margin:0px;padding:0px;border:0px;color:rgb(0,0,0)">)<= span style=3D"margin:0px;padding:0px;border:0px;color:rgb(0,0,0)"> try { String text =3D " " String charset =3D "UTF-16LE" file.withPrintWriter(charset) { it << text } println "withPrintWriter" file.getBytes().each { System.out.format("%02x ", it) } PrintWriter w =3D new PrintWriter(file, charset) w.print(text) w.close() println "\n\nnew PrintWriter" file.getBytes().each { System.out.format("%02x ", it) } }= = finally { file.delete() }=

Outputs

withPrintWriter
ff fe 20 00=20

new PrintWriter
20 00

Is this difference in behavior intenti= onal?=C2=A0 It seems kinda odd to me.

-Keegan



<= font color=3D"#888888">--
<= div>
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet





--
=
Guillaume= Laforge
Groovy Project Manager
Product Ninja &= ; Advocate at Restlet<= br>





--
=
=
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet

--001a11352cde60672e0518109c70--