subversion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Foad <julianf...@btopenworld.com>
Subject Re: Subversion binary file detection is look like broken
Date Fri, 03 Oct 2014 12:15:19 GMT
Stefan Sperling wrote:
> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>>    Subversion console client try to detect binary file with algorythm:
>> 
>>     1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>>        check as first N bytes is corret UTF-8?);
>>     2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>>        distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>>        256) ^ 1024 = ~1.8%);
>>     3. File is BINARY if first 1024 bytes contains over 85% of characters
>>        not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>>        bytes, ~60%).
>> 
>>    This algoritm looks like broken.

The requirement (3) for >85% non-ASCII* bytes => binary, was a historical accident.
The 
original intention was >15% non-ASCII bytes => binary, or in other words >85% ASCII
bytes => text. Quoting from libsvn_subr/io.c:svn_io_is_binary_data():

     NOTE:  Originally, I intended to target 85% of the bytes being in
     the specified ranges, but I flubbed the condition.  At any rate,
     folks aren't complaining, so I'm not sure that it's worth
     adjusting this retroactively now.

Perhaps now is the time to change that to match the original intent.

* I use the term ASCII loosely to mean "bytes in those two ranges".


> Can you suggest a better algoritm?
> 
>> For example:
>>     1. File "text.txt":
>> Is file contains text block from wikipedia about subversion in UTF-8
>> (https://ru.wikipedia.org/wiki/Subversion) and unfortunaly contains too
>> many cyrillic charactes (on character - 2 "binary" bytes).
>>     2. File "binary.txt" detected as "text"
>> It was created by "dd if=/dev/urandom of=binary.txt count=1 bs=2048" and
>> unfortunaly does not contains ZERO byte in first 1024 bytes.

Changing the 85% condition would fix example 2. However it would make example 1 occur more
often, unless we also make valid UTF-8 be detected as text.

It does sound like a good idea to make valid UTF-8 be detected as text.

- Julian

Mime
View raw message