groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Laforge <glafo...@gmail.com>
Subject Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()
Date Tue, 09 Jun 2015 07:18:53 GMT
For that point, perhaps it's a limitation of Java itself not recognizing
that alias?

2015-06-08 23:41 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:

> Another point of interest is that the current code doesn't respect
> aliases.  For example, the charset string "UTF_16LE" will not write the
> BOM, despite being an alias for "UTF-16LE"
>
> -Keegan
> On Jun 8, 2015 5:20 PM, "Keegan Witt" <keeganwitt@gmail.com> wrote:
>
>> The code as-is today writes the BOM regardless of platform.  I just
>> tested in Linux with the same results.  I think there are 2 parts to the
>> question of "what's the correct behavior?"
>>
>> 1.  Should the BOM be written at all, particularly when the platform is
>> Windows?
>> 2.  Should the behavior of *withPrintWriter* differ (even if the
>> difference is to be smarter) from the behavior of *new PrintWriter*?
>>
>> *Discussion*
>> 1.  Strictly speaking, yes.  Because RFC 2781
>> <http://tools.ietf.org/html/rfc2781> states in section 4.3 to assume big
>> endian if there is no BOM.  However, in practice, many applications
>> disregard the RFC and assume little-endian because that's what Windows
>> does
>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>> Because of this, the behavior could be changed so that when writing
>> UTF-16LE on Windows, it doesn't write the BOM.  But in my opinion, it's
>> best practice to always write a BOM when working with UTF-16, and Java
>> should have done this in their implementation of their PrintWriter.
>>
>> 2.  This is a tough one.  Arguably, *withPrintWriter* is doing the
>> smarter, more correct behavior, but the typical user would assume this is
>> just a shorthand convenience for newing up a PrintWriter (I certainly
>> did).  So the question is, is it better to just document this difference in
>> the GroovyDoc?  Or to change the behavior to be closer to Java?  And if the
>> latter, what breakages would that cause within Groovy itself?  Making that
>> change could break folks in production, because they could rely on that BOM
>> being there, in cases for example where the file is created on Windows, but
>> then processed on Linux or when working with a third party library that is
>> more picky about the presence of a BOM.
>>
>> -Keegan
>>
>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume Laforge <glaforge@gmail.com>
>> wrote:
>>
>>> Now... is it what should be done or not is the good question to ask :-)
>>> Does Windows manages to open UTF-16 files without BOMs?
>>>
>>> 2015-06-08 22:17 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>
>>>> I forgot to mention that.  Yes, I ran the test mentioned in Windows.
>>>>
>>>> On Mon, Jun 8, 2015 at 3:54 PM, Guillaume Laforge <glaforge@gmail.com>
>>>> wrote:
>>>>
>>>>> That's a good question.
>>>>> I guess this is happening on Windows? (I haven't tried here, since I'm
>>>>> on OS X)
>>>>> I think BOMs were mandatory in text files on Windows.
>>>>>
>>>>> 2015-06-08 17:53 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>
>>>>>> I've always taken a perverse pleasure in character encoding
>>>>>> problems.  I was intrigued by this SO question
>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
on
>>>>>> UTF 16 BOMs in Java vs Groovy.
>>>>>>
>>>>>> It appears using withPrintWriter(charset) produces a BOM whereas
new
>>>>>> PrintWriter(file, charset) does not.  As demonstrated here:
>>>>>>
>>>>>> File file = new File("tmp.txt")try {
>>>>>>     String text = " "
>>>>>>     String charset = "UTF-16LE"
>>>>>>
>>>>>>     file.withPrintWriter(charset) { it << text }
>>>>>>     println "withPrintWriter"
>>>>>>     file.getBytes().each { System.out.format("%02x ", it) }
>>>>>>
>>>>>>     PrintWriter w = new PrintWriter(file, charset)
>>>>>>     w.print(text)
>>>>>>     w.close()
>>>>>>     println "\n\nnew PrintWriter"
>>>>>>     file.getBytes().each { System.out.format("%02x ", it) }} finally
{
>>>>>>     file.delete()}
>>>>>>
>>>>>> Outputs
>>>>>>
>>>>>> withPrintWriter
>>>>>> ff fe 20 00
>>>>>>
>>>>>> new PrintWriter
>>>>>> 20 00
>>>>>>
>>>>>>
>>>>>> Is this difference in behavior intentional?  It seems kinda odd to
me.
>>>>>>
>>>>>> -Keegan
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Guillaume Laforge
>>>>> Groovy Project Manager
>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>
>>>>> Blog: http://glaforge.appspot.com/
>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Guillaume Laforge
>>> Groovy Project Manager
>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>
>>> Blog: http://glaforge.appspot.com/
>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>
>>
>>


-- 
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet <http://restlet.com>

Blog: http://glaforge.appspot.com/
Social: @glaforge <http://twitter.com/glaforge> / Google+
<https://plus.google.com/u/0/114130972232398734985/posts>

Mime
View raw message