Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (athena.apache.org: domain of bozaro@yandex.ru designates
 84.201.143.140 as permitted sender)
Message-ID: <542F8DC2.3000000@yandex.ru>
Date: Sat, 04 Oct 2014 10:03:46 +0400
From: "Artem V. Navrotskiy" <bozaro@yandex.ru>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: dev@subversion.apache.org
Subject: Re: Subversion binary file detection is look like broken
References: <1048611412321192@web20g.yandex.ru>
 <20141003113501.GF1256@ted.stsp.name>
In-Reply-To: <20141003113501.GF1256@ted.stsp.name>
Content-Type: multipart/alternative;
 boundary="------------090109040208050506020809"

This is a multi-part message in MIME format.
--------------090109040208050506020809
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

Hello,

03.10.2014 15:35, Stefan Sperling пишет:
> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>>     Hello,
>>
>>
>>
>>     Subversion console client try to detect binary file with algorythm:
>>
>>      1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>>         check as first N bytes is corret UTF-8?);
>>      2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>>         distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>>         256) ^ 1024 = ~1.8%);
>>      3. File is BINARY if first 1024 bytes contains over 85% of characters
>>         not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>>         bytes, ~60%).
>>
>>     This algoritm looks like broken.
>>
> Can you suggest a better algoritm?
About false positive:

 1. If text file detected as binary:
      * with "svn:auto-props = '*.txt = svn:eol-style=native'" svn
        client block adding this file: svn:eol-style and
        svn:mime-type=application/octet-stream can't be defined
        simultaneously;
        You have a workaround:
          o create empty file;
          o run svn add for empty file;
          o replace empty file by real data;
          o commit.
      * you can't diff and merge this file (Cannot display: file marked
        as a binary type.).
        You can't fix it, because you can't remove svn:mime-type
        property in last modified revision.
 2. If binary file detected as text:
      * svn diff and merge display unusable output.
        You can fix it in current revision by set svn:mime-type property.

I think, false positive, when text file detected as binary is more annoying.


About file type detection:

 1. File detection algorythm must be as simple, as possible.
 2. If first N bytes contains ZERO byte - file is binary.
 3. If file is valid UTF-8 - file is text.
 4. If file contains too many binary characters - file is binary.
    I think, definitely binary charactes is: 0x00-0x08, 0x0B, 0x0C,
    0x0E-0x1F, 0x7F (29 characters, ~11.3%).
    This characters very rarely uses in text files. Characters from
    range 0x80-0xFF can identify as letters in some encodings.
    Comparison threshold should be significantly lower than the
    percentage of data characters in a normal distribution.
    For example, if file contains about 2.5% of N bytes as "binary"
    characters - this file is binary.


Overall, I seem to be successful following implementations:

 1. Git autodetection: if first 8000 bytes contains ZERO byte - file is
    binary.
    + As simple, as possible;
    + Can't detect text files as binary;
    - Can detect some binary files as text;
 2. Byte range autodetection: if first N bytes contains byte from range
    0x00-0x08 or 0x0E-0x1F - file is binary.
    + Still simple;
    - Can detect some short binary files as text;
 3. Byte range autodetection: if first N bytes contains about N% of
    bytes: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F - file is binary.
    - Not so simple;


Best regards,
Navrotskiy Artem.

--------------090109040208050506020809
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Hello,<br>
      <br>
      03.10.2014 15:35, Stefan Sperling пишет:<br>
    </div>
    <blockquote cite="mid:20141003113501.GF1256@ted.stsp.name"
      type="cite">
      <pre wrap="">On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
</pre>
      <blockquote type="cite">
        <pre wrap="">   Hello,


   Subversion console client try to detect binary file with algorythm:

    1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
       check as first N bytes is corret UTF-8?);
    2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
       distribution of bytes takes change of absent ZERO byte: (1 - 1 /
       256) ^ 1024 = ~1.8%);
    3. File is BINARY if first 1024 bytes contains over 85% of characters
       not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
       bytes, ~60%).

   This algoritm looks like broken.

</pre>
      </blockquote>
      <pre wrap="">
Can you suggest a better algoritm?
</pre>
    </blockquote>
    About false positive:<br>
    <br>
    <ol>
      <li>If text file detected as binary:</li>
      <ul>
        <li>with "svn:auto-props = '*.txt = svn:eol-style=native'" svn
          client block adding this file: svn:eol-style and
          svn:mime-type=application/octet-stream can't be defined
          simultaneously;<br>
          You have a workaround:</li>
        <ul>
          <li>create empty file;</li>
          <li>run svn add for empty file;</li>
          <li>replace empty file by real data;</li>
          <li>commit.</li>
        </ul>
        <li>you can't diff and merge this file (Cannot display: file
          marked as a binary type.).<br>
          You can't fix it, because you can't remove svn:mime-type
          property in last modified revision.</li>
      </ul>
      <li>If binary file detected as text:</li>
      <ul>
        <li>svn diff and merge display unusable output.<br>
          You can fix it in current revision by set svn:mime-type
          property.<br>
        </li>
      </ul>
    </ol>
    I think, false positive, when text file detected as binary is more
    annoying.<br>
    <br>
     <br>
    About file type detection:<br>
    <ol>
      <li>File detection algorythm must be as simple, as possible.</li>
      <li>If first N bytes contains ZERO byte - file is binary.</li>
      <li>If file is valid UTF-8 - file is text.</li>
      <li>If file contains too many binary characters - file is binary.<br>
        I think, definitely binary charactes is: 0x00-0x08, 0x0B, 0x0C,
        0x0E-0x1F, 0x7F (29 characters, ~11.3%).<br>
        This characters very rarely uses in text files. Characters from
        range 0x80-0xFF can identify as letters in some encodings.<br>
        Comparison threshold should be significantly lower than the
        percentage of data characters in a normal distribution.<br>
        For example, if file contains about 2.5% of N bytes as "binary"
        characters - this file is binary.</li>
    </ol>
    <br>
    Overall, I seem to be successful following implementations:<br>
    <br>
    <ol>
      <li>Git autodetection: if first 8000 bytes contains ZERO byte -
        file is binary.<br>
        + As simple, as possible;<br>
        + Can't detect text files as binary;<br>
        - Can detect some binary files as text;</li>
      <li>Byte range autodetection: if first N bytes contains byte from
        range 0x00-0x08 or 0x0E-0x1F - file is binary.<br>
        + Still simple;<br>
        - Can detect some short binary files as text;</li>
      <li>Byte range autodetection: if first N bytes contains about N%
        of bytes: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F - file is
        binary.<br>
        - Not so simple;<br>
      </li>
    </ol>
    <br>
    <br>
    Best regards,<br>
    Navrotskiy Artem.<br>
  </body>
</html>

--------------090109040208050506020809--