Return-Path: X-Original-To: apmail-subversion-dev-archive@minotaur.apache.org Delivered-To: apmail-subversion-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81A7B17DDC for ; Sat, 4 Oct 2014 06:04:15 +0000 (UTC) Received: (qmail 38066 invoked by uid 500); 4 Oct 2014 06:04:15 -0000 Delivered-To: apmail-subversion-dev-archive@subversion.apache.org Received: (qmail 38007 invoked by uid 500); 4 Oct 2014 06:04:15 -0000 Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@subversion.apache.org Received: (qmail 37986 invoked by uid 99); 4 Oct 2014 06:04:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Oct 2014 06:04:14 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bozaro@yandex.ru designates 84.201.143.140 as permitted sender) Received: from [84.201.143.140] (HELO forward7l.mail.yandex.net) (84.201.143.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Oct 2014 06:04:08 +0000 Received: from smtp1h.mail.yandex.net (smtp1h.mail.yandex.net [84.201.187.144]) by forward7l.mail.yandex.net (Yandex) with ESMTP id 30762BC0F46 for ; Sat, 4 Oct 2014 10:03:47 +0400 (MSK) Received: from smtp1h.mail.yandex.net (localhost [127.0.0.1]) by smtp1h.mail.yandex.net (Yandex) with ESMTP id CA3C11340391 for ; Sat, 4 Oct 2014 10:03:46 +0400 (MSK) Received: from host-79-165-173-4.qwerty.ru (host-79-165-173-4.qwerty.ru [79.165.173.4]) by smtp1h.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id bBcX8v9Dip-3kxS4fNi; Sat, 4 Oct 2014 10:03:46 +0400 (using TLSv1.2 with cipher AES128-SHA (128/128 bits)) (Client certificate not present) X-Yandex-Uniq: 834b9601-6a60-4e25-a717-35dfda8d41ee DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.ru; s=mail; t=1412402626; bh=EOJolJ2E0+OLBc0aFcG15fxx6lPB9gKxPiyW/7VcCo8=; h=Message-ID:Date:From:User-Agent:MIME-Version:To:Subject: References:In-Reply-To:Content-Type; b=arG/7KZfFwAJk9KHzOQp2iJvp3SIfAhCS1WMdAn8lm47u20aHkexkK949jWtpWM+A xlEV0Kds4tQ3Xhc7syNhvJ7JhM0xNwsFNk8t9YgdNY3ufHhTXTA+QRlRjVcp7NY3lA VT5VogexGYQhxaqmaTLLD/kJMFlc6xX+Yj7GWVNg= Authentication-Results: smtp1h.mail.yandex.net; dkim=pass header.i=@yandex.ru Message-ID: <542F8DC2.3000000@yandex.ru> Date: Sat, 04 Oct 2014 10:03:46 +0400 From: "Artem V. Navrotskiy" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 To: dev@subversion.apache.org Subject: Re: Subversion binary file detection is look like broken References: <1048611412321192@web20g.yandex.ru> <20141003113501.GF1256@ted.stsp.name> In-Reply-To: <20141003113501.GF1256@ted.stsp.name> Content-Type: multipart/alternative; boundary="------------090109040208050506020809" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------090109040208050506020809 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hello, 03.10.2014 15:35, Stefan Sperling пишет: > On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote: >> Hello, >> >> >> >> Subversion console client try to detect binary file with algorythm: >> >> 1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not >> check as first N bytes is corret UTF-8?); >> 2. File is BINARY if first 1024 bytes contains ZERO byte (uniform >> distribution of bytes takes change of absent ZERO byte: (1 - 1 / >> 256) ^ 1024 = ~1.8%); >> 3. File is BINARY if first 1024 bytes contains over 85% of characters >> not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary" >> bytes, ~60%). >> >> This algoritm looks like broken. >> > Can you suggest a better algoritm? About false positive: 1. If text file detected as binary: * with "svn:auto-props = '*.txt = svn:eol-style=native'" svn client block adding this file: svn:eol-style and svn:mime-type=application/octet-stream can't be defined simultaneously; You have a workaround: o create empty file; o run svn add for empty file; o replace empty file by real data; o commit. * you can't diff and merge this file (Cannot display: file marked as a binary type.). You can't fix it, because you can't remove svn:mime-type property in last modified revision. 2. If binary file detected as text: * svn diff and merge display unusable output. You can fix it in current revision by set svn:mime-type property. I think, false positive, when text file detected as binary is more annoying. About file type detection: 1. File detection algorythm must be as simple, as possible. 2. If first N bytes contains ZERO byte - file is binary. 3. If file is valid UTF-8 - file is text. 4. If file contains too many binary characters - file is binary. I think, definitely binary charactes is: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F (29 characters, ~11.3%). This characters very rarely uses in text files. Characters from range 0x80-0xFF can identify as letters in some encodings. Comparison threshold should be significantly lower than the percentage of data characters in a normal distribution. For example, if file contains about 2.5% of N bytes as "binary" characters - this file is binary. Overall, I seem to be successful following implementations: 1. Git autodetection: if first 8000 bytes contains ZERO byte - file is binary. + As simple, as possible; + Can't detect text files as binary; - Can detect some binary files as text; 2. Byte range autodetection: if first N bytes contains byte from range 0x00-0x08 or 0x0E-0x1F - file is binary. + Still simple; - Can detect some short binary files as text; 3. Byte range autodetection: if first N bytes contains about N% of bytes: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F - file is binary. - Not so simple; Best regards, Navrotskiy Artem. --------------090109040208050506020809 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
Hello,

03.10.2014 15:35, Stefan Sperling пишет:
On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
   Hello,



   Subversion console client try to detect binary file with algorythm:

    1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
       check as first N bytes is corret UTF-8?);
    2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
       distribution of bytes takes change of absent ZERO byte: (1 - 1 /
       256) ^ 1024 = ~1.8%);
    3. File is BINARY if first 1024 bytes contains over 85% of characters
       not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
       bytes, ~60%).

   This algoritm looks like broken.

Can you suggest a better algoritm?
About false positive:

  1. If text file detected as binary:
    • with "svn:auto-props = '*.txt = svn:eol-style=native'" svn client block adding this file: svn:eol-style and svn:mime-type=application/octet-stream can't be defined simultaneously;
      You have a workaround:
      • create empty file;
      • run svn add for empty file;
      • replace empty file by real data;
      • commit.
    • you can't diff and merge this file (Cannot display: file marked as a binary type.).
      You can't fix it, because you can't remove svn:mime-type property in last modified revision.
  2. If binary file detected as text:
    • svn diff and merge display unusable output.
      You can fix it in current revision by set svn:mime-type property.
I think, false positive, when text file detected as binary is more annoying.

 
About file type detection:
  1. File detection algorythm must be as simple, as possible.
  2. If first N bytes contains ZERO byte - file is binary.
  3. If file is valid UTF-8 - file is text.
  4. If file contains too many binary characters - file is binary.
    I think, definitely binary charactes is: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F (29 characters, ~11.3%).
    This characters very rarely uses in text files. Characters from range 0x80-0xFF can identify as letters in some encodings.
    Comparison threshold should be significantly lower than the percentage of data characters in a normal distribution.
    For example, if file contains about 2.5% of N bytes as "binary" characters - this file is binary.

Overall, I seem to be successful following implementations:

  1. Git autodetection: if first 8000 bytes contains ZERO byte - file is binary.
    + As simple, as possible;
    + Can't detect text files as binary;
    - Can detect some binary files as text;
  2. Byte range autodetection: if first N bytes contains byte from range 0x00-0x08 or 0x0E-0x1F - file is binary.
    + Still simple;
    - Can detect some short binary files as text;
  3. Byte range autodetection: if first N bytes contains about N% of bytes: 0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F, 0x7F - file is binary.
    - Not so simple;


Best regards,
Navrotskiy Artem.
--------------090109040208050506020809--