subversion-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Branko ─îibej <>
Subject Re: SVN Blame Returns Corrupt Data
Date Fri, 11 Oct 2013 16:22:33 GMT
On 11.10.2013 18:12, Bob Archer wrote:
>> On 11.10.2013 17:19, Bob Archer wrote:
>>>> On 11.10.2013 16:55, Bob Archer wrote:
>>>>>> On 11.10.2013 15:58, Bob Archer wrote:
>>>>>>>> On Thu, Oct 10, 2013 at 5:49 PM, Bob Archer
>> <>
>>>>>> wrote:
>>>>>>>> I assume he was asking how to "fix" the blame. Cause, sure,
>>>>>>>> could open the file, convert it back to UTF-8 with CRLF line
>>>>>>>> endings... and commit it... of course, now blame is going
to show
>>>>>>>> him on every line, since he just changed every line.
>>>>>>>> That's exactly what I meant.  You're correct with how the
>>>>>>>> is handled.  I committed the UTF-8 copy to a test branch,
>>>>>>>> and it showed every line as being changed.  Unfortunately
>>>>>>>> looks like this is our
>>>>>> best option.
>>>>>>> Yep, we have done the same thing. As a matter of fact, I just
>>>>>>> the past
>>>>>> few days rescripted all our database scripts to be UTF-8 since
>>>>>> merging them just doesn't work correctly when they are UTF-16 even
>>>>>> if you remove the binary mime type.
>>>>>>>> On Thu, Oct 10, 2013 at 7:07 PM, Ben Reser <>
>>>>>>>> At current blame is not UTF-16 aware.
>>>>>>> It's not just blame that isn't... the diff engine, or whatever
>>>>>>> detects file
>>>>>> types always considers UTF-16 files to be binary. If you "add" a
>>>>>> UTF-16 file you see that svn adds the application/octet-stream mime
>>>>>> type.  There is an issue in the bug database about this from when
>>>>>> reported/complained about it... however it hasn't been addressed.
>>>>>> I'm surprised still at this time that svn still can't support
>>>>>> UTF-16 text files as
>>>> text wrt adding, diffing, blaming, etc.
>>>>>> It's quite simple: no-one has written the necessary code. While I
>>>>>> can understand it's an interesting feature for Windows users, most
>>>>>> Subversion developers have other things to do. This being a
>>>>>> volunteer project, and most of us do not use Windows, you can
>>>>>> hardly expect anyone to spend several weeks on solving a problem
>>>>>> that has a perfectly simple workaround. Since
>>>>>> UFT-8 and UTF-16 can be interchanged without data loss, there are
>>>>>> other, much more important things to do in Subversion.
>>>>> I appreciate all that you said. I didn't expect that UTF-16 was so
>>>>> uncommon
>>>> in non-Windows OSes. A large number of dev tools that I work with on
>>>> Windows, especially the Microsoft tools default to creating UTF-16 files.
>>>>> I disagree with your "can be converted without data loss". If you
>>>>> need UTF-
>>>> 16 then you need it. Also, if you are working in an international
>>>> team and you have developers with other language Oss which have
>>>> different code pages then what you see when you look at a UTF-8 file
>>>> might be different than what I see.
>>>> I don't follow. Both UTF-16 and UTF-8 are complete representations of
>>>> the Unicode character set. Exactly the same code sequences can be
>>>> represented in both encodings. You can convert from UTF-16 to UTF-8
>>>> and back and get exactly the same sequence of bytes.
>>> Ok, I have to back pedal here a bit.  You are correct, UTF-8 is a Unicode
>> format and can store all characters. It's not a UTF-8 vs UTF-16 issue (Friday
>> senior moment). What I recall being told by one of the subversion
>> developers was that subversion only supported the ASCII character set and
>> while UTF-8 was compatible with ASCII it didn't truly support Unicode files.
>>> However, this blog entry seems to dispute that:
>>> Would adding that mime-type to this file fix the blame issues this user is
>> seeing?
>> I think the user is just very lucky. Subversion does not actually try to interpret
>> the svn:mime-type property, other than to determine whether to treat a file
>> as text or binary. (The comment is correct in that the proper parameter is
>> charset=, not encoding=, but that's not important for this discussion).
>> Subversion's merge algorithm depends on being able to detect line endings
>> in the file, and always scans the file as a sequence of bytes.
>> There are several ways to represent line endings in a UTF-16 file (shown here
>> as hex byte sequences):
>>   * 00 0A (Unix newline, UTF16-BE)
>>   * 00 0D 00 0A (Windows newline, UTF16-BE)
>>   * 0A 00 (Unix newline, UTF16-LE)
>>   * 0D 00 0A 00 (Windows newline, UTF16-LE)
>>   * 24 24 (Unicode newline, same in LE and BE)
>> Subversion, however, expects one of the following newline sequences:
>>   * 0A (Unix newline)
>>   * 0D 0A (Windows newline)
>> My best guess as to what's happening is that the 0A bytes, a.k.a. the ASCII
>> newline character, are interpreted as the end-of-line markers, and the zero
>> bytes are treated as part of the text. In most cases, the result will be close to
>> correct, as long as there are no conflicts in the merge -- because Subversion
>> will not emit conflict markers in UTF-16.
>> Of course, if someone used the U+2424 newline code point instead, then in
>> the worst case, the whole file would be interpreted as a single line.
>> -- Brane
> Great information.. thanks for that.
> Bottom line is use UTF-8 for your text files and svn will be happy and work correctly.
How hard would it be to create a warning on an add that a file looks like UTF-16 and should
be converted to UTF-8 otherwise it will be treated as a binary file?

You'd have to extend Subversion's file type detection to detect UTF-16.
See svn_io_detect_mimetype2 in line 3333 in this file:
Subversion currently only looks at the first 1k Bytes of a file. It may
be enough to check that this initial part of the file contains only
valid UTF-16 (BE or LE) codes. Note that the function takes a dictionary
of (file extension, MIME type) pairs and if it finds a matching type, it
doesn't look at the file at all; and this may not be quite correct,
given that there are no special file extensions that would flag that the
file contains UTF-16.

-- Brane

Branko ─îibej | Director of Subversion
WANdisco // Non-Stop Data

View raw message