subversion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Barry Scott <ba...@barrys-emacs.org>
Subject Re: Subversion binary file detection is look like broken
Date Fri, 03 Oct 2014 20:30:40 GMT

On 3 Oct 2014, at 13:15, Julian Foad <julianfoad@btopenworld.com> wrote:

> Stefan Sperling wrote:
>> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrotskiy Artem wrote:
>>>     Subversion console client try to detect binary file with algorythm:
>>> 
>>>      1. File is NOT BINARY if it contains only BOM UTF-8 signature (why not
>>>         check as first N bytes is corret UTF-8?);
>>>      2. File is BINARY if first 1024 bytes contains ZERO byte (uniform
>>>         distribution of bytes takes change of absent ZERO byte: (1 - 1 /
>>>         256) ^ 1024 = ~1.8%);
>>>      3. File is BINARY if first 1024 bytes contains over 85% of characters
>>>         not in range 0x07-0x0D, 0x20-0x7F (total we have 153 "binary"
>>>         bytes, ~60%).
>>> 
>>>     This algoritm looks like broken.
> 
> The requirement (3) for >85% non-ASCII* bytes => binary, was a historical accident.
The 
> original intention was >15% non-ASCII bytes => binary, or in other words >85%
ASCII bytes => text. Quoting from libsvn_subr/io.c:svn_io_is_binary_data():
> 
>      NOTE:  Originally, I intended to target 85% of the bytes being in
>      the specified ranges, but I flubbed the condition.  At any rate,
>      folks aren't complaining, so I'm not sure that it's worth
>      adjusting this retroactively now.
> 
> Perhaps now is the time to change that to match the original intent.
> 
> * I use the term ASCII loosely to mean "bytes in those two ranges".
> 
> 
>> Can you suggest a better algoritm?
>> 
>>> For example:
>>>      1. File "text.txt":
>>> Is file contains text block from wikipedia about subversion in UTF-8
>>> (https://ru.wikipedia.org/wiki/Subversion) and unfortunaly contains too
>>> many cyrillic charactes (on character - 2 "binary" bytes).
>>>      2. File "binary.txt" detected as "text"
>>> It was created by "dd if=/dev/urandom of=binary.txt count=1 bs=2048" and
>>> unfortunaly does not contains ZERO byte in first 1024 bytes.
> 
> Changing the 85% condition would fix example 2. However it would make example 1 occur
more often, unless we also make valid UTF-8 be detected as text.
> 
> It does sound like a good idea to make valid UTF-8 be detected as text.

If you do look at this you might want to fix the problem of .svg files being classed as binary
where as they are XML. I'm guessing that the mime type is used that assumes that an image/*
cannot be text.

Barry


Mime
View raw message