groovy-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Laforge <glafo...@gmail.com>
Subject Re: UTF16 BOM in new PrintWriter() vs withPrintWriter()
Date Tue, 09 Jun 2015 19:22:34 GMT
2015-06-09 18:57 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:

> I created PR 37 <https://github.com/apache/incubator-groovy/pull/37> to
> correct the JavaDoc I mentioned (as well as to document the existing
> behavior for the non-NIO methods).
>
> Java doesn't eat the BOM, but this is a problem Java folks are used to
> dealing with, and why things like Apache Common-IO's BOMInputStream
> <https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html>
> exist.
>

That's also why I made Groovy eat the BOM too, so that it's transparent to
our users :-)
But that was a long time ago since I worked on those parts of the codebase,
and it's been refactored quite a bit (by Jim for example).


>
> -Keegan
>
> On Tue, Jun 9, 2015 at 11:33 AM, Guillaume Laforge <glaforge@gmail.com>
> wrote:
>
>> So now, how to decide what's best? :-)
>>
>> Is a Java reader happy with the BOM? and eats it transparently? (I think
>> in the past that wasn't the case but I may be wrong)
>>
>> 2015-06-09 17:21 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>
>>> That's an excellent point, Paolo.  NioGroovyMethods.newWriter claims
>>> (in the JavaDoc) it will write the BOM if needed, but it doesn't because it
>>> uses Java's implementation rather than with Groovy's
>>> writeUTF16BomIfRequired.  None of the methods in NioGroovyMethods use
>>> writeUTF16BomIfRequired.
>>>
>>> Whichever we decide, we should be consistent.
>>>
>>> -Keegan
>>>
>>> On Tue, Jun 9, 2015 at 11:08 AM, Paolo Di Tommaso <
>>> paolo.ditommaso@gmail.com> wrote:
>>>
>>>> I'm wondering if NioGroovyMethods that implement the write methods for
>>>> Path should do the same.
>>>>
>>>>
>>>> Cheers,
>>>> Paolo
>>>>
>>>>
>>>> On Tue, Jun 9, 2015 at 4:02 PM, Keegan Witt <keeganwitt@gmail.com>
>>>> wrote:
>>>>
>>>>> Cool.  I'll wait for PR 36 to be merged first, because I also was
>>>>> thinking the Javadoc would be changed from
>>>>>     is "UTF-16BE" or "UTF-16LE"
>>>>> to
>>>>>     is "UTF-16BE" or "UTF-16LE" (or an equivalent alias)
>>>>>
>>>>> -Keegan
>>>>>
>>>>>
>>>>> On Tue, Jun 9, 2015 at 9:08 AM, Guillaume Laforge <glaforge@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> 2015-06-09 15:04 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>
>>>>>>> Created GROOVY-7461
>>>>>>> <https://issues.apache.org/jira/browse/GROOVY-7461> and
PR 36
>>>>>>> <https://github.com/apache/incubator-groovy/pull/36>.
>>>>>>>
>>>>>>
>>>>>> Cool!
>>>>>>
>>>>>>
>>>>>>> How would you feel about a PR to copy the Javadoc comment mentioning
>>>>>>> the UTF-16 BOM on File.newWriter to all the other methods that
use
>>>>>>> writeUTF16BomIfRequired (at least until we decide we're going
to
>>>>>>> change the current behavior)?
>>>>>>>
>>>>>>
>>>>>> Right, worth it!
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> -Keegan
>>>>>>>
>>>>>>> On Tue, Jun 9, 2015 at 8:17 AM, Guillaume Laforge <
>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>
>>>>>>>> Good point!
>>>>>>>>
>>>>>>>> 2015-06-09 14:11 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>
>>>>>>>>> That's only available in Java 7.  Isn't Groovy still
targeting 1.6
>>>>>>>>> for the non-indy version?
>>>>>>>>>
>>>>>>>>> -Keegan
>>>>>>>>> On Jun 9, 2015 7:56 AM, "Guillaume Laforge" <glaforge@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Well spotted!
>>>>>>>>>>
>>>>>>>>>> You could also compare with the StandardCharset,
instead of going
>>>>>>>>>> through the name comparison:
>>>>>>>>>>
>>>>>>>>>> http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
>>>>>>>>>>
>>>>>>>>>> 2015-06-09 13:49 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> No, it's a Groovy bug.
>>>>>>>>>>>
>>>>>>>>>>> private static void writeUTF16BomIfRequired(final
String charset, final OutputStream stream) throws IOException {
>>>>>>>>>>>     if ("UTF-16BE".equals(charset)) {
>>>>>>>>>>>         writeUtf16Bom(stream, true);
>>>>>>>>>>>     } else if ("UTF-16LE".equals(charset)) {
>>>>>>>>>>>         writeUtf16Bom(stream, false);
>>>>>>>>>>>     }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> should be
>>>>>>>>>>>
>>>>>>>>>>> private static void writeUTF16BomIfRequired(final
String charset, final OutputStream stream) throws IOException {
>>>>>>>>>>>     if ("UTF-16BE".equals(Charset.forName(charset).name()))
{
>>>>>>>>>>>         writeUtf16Bom(stream, true);
>>>>>>>>>>>     } else if ("UTF-16LE".equals(Charset.forName(charset).name()))
{
>>>>>>>>>>>         writeUtf16Bom(stream, false);
>>>>>>>>>>>     }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> in org.codehaus.groovy.runtime.ResourceGroovyMethods.
 We'll
>>>>>>>>>>> probably want to fix that regardless of what
we decide on the
>>>>>>>>>>> *withPrintWriter* question.  I'll open a Jira
and a PR.
>>>>>>>>>>>
>>>>>>>>>>> -Keegan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 9, 2015 at 3:21 AM, Guillaume Laforge
<
>>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> From Groovy's point of view (ie. when you're
coding in Groovy),
>>>>>>>>>>>> the BOM is automatically discarded when you
use one of our reader methods
>>>>>>>>>>>> (withReader, etc), so it's transparent whether
the BOM is here or not.
>>>>>>>>>>>>
>>>>>>>>>>>> I tend to think that having the BOM always
is a good thing (I
>>>>>>>>>>>> even thought that was mandatory), but Groovy
should guess the endianness
>>>>>>>>>>>> regardless anyway.
>>>>>>>>>>>>
>>>>>>>>>>>> Happy to hear what others think too about
all this though.
>>>>>>>>>>>>
>>>>>>>>>>>> Guillaume
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-06-08 23:20 GMT+02:00 Keegan Witt <keeganwitt@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> The code as-is today writes the BOM regardless
of platform.  I
>>>>>>>>>>>>> just tested in Linux with the same results.
 I think there are 2 parts to
>>>>>>>>>>>>> the question of "what's the correct behavior?"
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.  Should the BOM be written at all,
particularly when the
>>>>>>>>>>>>> platform is Windows?
>>>>>>>>>>>>> 2.  Should the behavior of *withPrintWriter*
differ (even if
>>>>>>>>>>>>> the difference is to be smarter) from
the behavior of *new
>>>>>>>>>>>>> PrintWriter*?
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Discussion*
>>>>>>>>>>>>> 1.  Strictly speaking, yes.  Because
RFC 2781
>>>>>>>>>>>>> <http://tools.ietf.org/html/rfc2781>
states in section 4.3 to
>>>>>>>>>>>>> assume big endian if there is no BOM.
 However, in practice, many
>>>>>>>>>>>>> applications disregard the RFC and assume
little-endian because that's what Windows
>>>>>>>>>>>>> does
>>>>>>>>>>>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx>.
>>>>>>>>>>>>> Because of this, the behavior could be
changed so that when writing
>>>>>>>>>>>>> UTF-16LE on Windows, it doesn't write
the BOM.  But in my opinion, it's
>>>>>>>>>>>>> best practice to always write a BOM when
working with UTF-16, and Java
>>>>>>>>>>>>> should have done this in their implementation
of their PrintWriter.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2.  This is a tough one.  Arguably, *withPrintWriter*
is
>>>>>>>>>>>>> doing the smarter, more correct behavior,
but the typical user would assume
>>>>>>>>>>>>> this is just a shorthand convenience
for newing up a PrintWriter (I
>>>>>>>>>>>>> certainly did).  So the question is,
is it better to just document this
>>>>>>>>>>>>> difference in the GroovyDoc?  Or to change
the behavior to be closer to
>>>>>>>>>>>>> Java?  And if the latter, what breakages
would that cause within Groovy
>>>>>>>>>>>>> itself?  Making that change could break
folks in production, because they
>>>>>>>>>>>>> could rely on that BOM being there, in
cases for example where the file is
>>>>>>>>>>>>> created on Windows, but then processed
on Linux or when working with a
>>>>>>>>>>>>> third party library that is more picky
about the presence of a BOM.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Keegan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jun 8, 2015 at 4:32 PM, Guillaume
Laforge <
>>>>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now... is it what should be done
or not is the good question
>>>>>>>>>>>>>> to ask :-)
>>>>>>>>>>>>>> Does Windows manages to open UTF-16
files without BOMs?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2015-06-08 22:17 GMT+02:00 Keegan
Witt <keeganwitt@gmail.com>
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I forgot to mention that.  Yes,
I ran the test mentioned in
>>>>>>>>>>>>>>> Windows.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jun 8, 2015 at 3:54 PM,
Guillaume Laforge <
>>>>>>>>>>>>>>> glaforge@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That's a good question.
>>>>>>>>>>>>>>>> I guess this is happening
on Windows? (I haven't tried
>>>>>>>>>>>>>>>> here, since I'm on OS X)
>>>>>>>>>>>>>>>> I think BOMs were mandatory
in text files on Windows.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2015-06-08 17:53 GMT+02:00
Keegan Witt <
>>>>>>>>>>>>>>>> keeganwitt@gmail.com>:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've always taken a perverse
pleasure in character
>>>>>>>>>>>>>>>>> encoding problems.  I
was intrigued by this SO question
>>>>>>>>>>>>>>>>> <http://stackoverflow.com/questions/30538461/why-groovy-file-write-with-utf-16le-produce-bom-char>
on
>>>>>>>>>>>>>>>>> UTF 16 BOMs in Java vs
Groovy.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It appears using withPrintWriter(charset)
produces a BOM
>>>>>>>>>>>>>>>>> whereas new PrintWriter(file,
charset) does not.  As
>>>>>>>>>>>>>>>>> demonstrated here:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> File file = new File("tmp.txt")try
{
>>>>>>>>>>>>>>>>>     String text = " "
>>>>>>>>>>>>>>>>>     String charset =
"UTF-16LE"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     file.withPrintWriter(charset)
{ it << text }
>>>>>>>>>>>>>>>>>     println "withPrintWriter"
>>>>>>>>>>>>>>>>>     file.getBytes().each
{ System.out.format("%02x ", it) }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>     PrintWriter w = new
PrintWriter(file, charset)
>>>>>>>>>>>>>>>>>     w.print(text)
>>>>>>>>>>>>>>>>>     w.close()
>>>>>>>>>>>>>>>>>     println "\n\nnew
PrintWriter"
>>>>>>>>>>>>>>>>>     file.getBytes().each
{ System.out.format("%02x ", it) }} finally {
>>>>>>>>>>>>>>>>>     file.delete()}
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Outputs
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> withPrintWriter
>>>>>>>>>>>>>>>>> ff fe 20 00
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> new PrintWriter
>>>>>>>>>>>>>>>>> 20 00
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is this difference in
behavior intentional?  It seems
>>>>>>>>>>>>>>>>> kinda odd to me.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Keegan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>>>>>> Product Ninja & Advocate
at Restlet <http://restlet.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>>>> Product Ninja & Advocate at Restlet
<http://restlet.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Guillaume Laforge
>>>>>>>>>>>> Groovy Project Manager
>>>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Laforge
>>>>>>>>>> Groovy Project Manager
>>>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>>>
>>>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>>>> Social: @glaforge <http://twitter.com/glaforge>
/ Google+
>>>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Laforge
>>>>>>>> Groovy Project Manager
>>>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>>>
>>>>>>>> Blog: http://glaforge.appspot.com/
>>>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Guillaume Laforge
>>>>>> Groovy Project Manager
>>>>>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>>>>>
>>>>>> Blog: http://glaforge.appspot.com/
>>>>>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>>>>>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Guillaume Laforge
>> Groovy Project Manager
>> Product Ninja & Advocate at Restlet <http://restlet.com>
>>
>> Blog: http://glaforge.appspot.com/
>> Social: @glaforge <http://twitter.com/glaforge> / Google+
>> <https://plus.google.com/u/0/114130972232398734985/posts>
>>
>
>


-- 
Guillaume Laforge
Groovy Project Manager
Product Ninja & Advocate at Restlet <http://restlet.com>

Blog: http://glaforge.appspot.com/
Social: @glaforge <http://twitter.com/glaforge> / Google+
<https://plus.google.com/u/0/114130972232398734985/posts>

Mime
View raw message