subversion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Artem V. Navrotskiy" <boz...@yandex.ru>
Subject Re: Subversion binary file detection is look like broken
Date Sat, 04 Oct 2014 06:03:46 GMT
Hello,

03.10.2014 15:35, Stefan Sperling пишет:
> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>>     Hello,
>>
>>
>>
>>     Subversion console client try to detect binary file with algorythm:
>>
>>      1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>>         check as first N bytes is corret UTF-8?);
>>      2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>>         distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>>         256) ^ 1024 = ~1.8%);
>>      3. File is BINARY if first 1024 bytes contains over 85% of characters
>>         not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>>         bytes, ~60%).
>>
>>     This algoritm looks like broken.
>>
> Can you suggest a better algoritm?
About false positive:

 1. If text file detected as binary:
      * with "svn:auto-props = '*.txt = svn:eol-style=native'" svn
        client block adding this file: svn:eol-style and
        svn:mime-type=application/octet-stream can't be defined
        simultaneously;
        You have a workaround:
          o create empty file;
          o run svn add for empty file;
          o replace empty file by real data;
          o commit.
      * you can't diff and merge this file (Cannot display: file marked
        as a binary type.).
        You can't fix it, because you can't remove svn:mime-type
        property in last modified revision.
 2. If binary file detected as text:
      * svn diff and merge display unusable output.
        You can fix it in current revision by set svn:mime-type property.

I think, false positive, when text file detected as binary is more annoying.


About file type detection:

 1. File detection algorythm must be as simple, as possible.
 2. If first N bytes contains ZERO byte - file is binary.
 3. If file is valid UTF-8 - file is text.
 4. If file contains too many binary characters - file is binary.
    I think, definitely binary charactes is: 0x00-0x08, 0x0B, 0x0C,
    0x0E-0x1F, 0x7F (29 characters, ~11.3%).
    This characters very rarely uses in text files. Characters from
    range 0x80-0xFF can identify as letters in some encodings.
    Comparison threshold should be significantly lower than the
    percentage of data characters in a normal distribution.
    For example, if file contains about 2.5% of N bytes as "binary"
    characters - this file is binary.


Overall, I seem to be successful following implementations:

 1. Git autodetection: if first 8000 bytes contains ZERO byte - file is
    binary.
    + As simple, as possible;
    + Can't detect text files as binary;
    - Can detect some binary files as text;
 2. Byte range autodetection: if first N bytes contains byte from range
    0x00-0x08 or 0x0E-0x1F - file is binary.
    + Still simple;
    - Can detect some short binary files as text;
 3. Byte range autodetection: if first N bytes contains about N% of
    bytes: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F - file is binary.
    - Not so simple;



Best regards,
Navrotskiy Artem.

Mime
View raw message