subversion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Navrotskiy Artem <boz...@ya.ru>
Subject Subversion binary file detection is look like broken
Date Fri, 03 Oct 2014 07:26:32 GMT
<div>Hello,</div><div> </div><div>Subversion console client
try to detect binary file with algorythm:</div><div><ol><li>File is
NOT BINARY if it contains only BOM UTF-8 signature (why not check as first N bytes is corret
UTF-8?);</li><li>File is BINARY if first 1024 bytes contains ZERO byte (uniform
distribution of bytes takes change of absent ZERO byte: (1 - 1 / 256) ^ 1024 = ~1.8%);</li><li>File
is BINARY if first 1024 bytes contains over 85% of characters not in range <span style="font-family:monospace;font-size:medium;white-space:pre;">0x07-0x0D,
</span><span style="font-family:monospace;font-size:medium;white-space:pre;">0x20-0x7F
(total we have 153 "binary" bytes, ~60%</span>).</li></ol><div>This
algoritm looks like broken.</div><div> </div><div>For example:</div><div><ol><li>File
"text.txt":<br />Is file contains text block from wikipedia about subversion in UTF-8
(<a href="https://ru.wikipedia.org/wiki/Subversion">https://ru.wikipedia.org/wiki/Subversion</a>)
and unfortunaly contains too many cyrillic charactes (on character - 2 "binary" bytes).</li><li>File
"binary.txt" detected as "text"<br />It was created by "dd if=/dev/urandom of=binary.txt
count=1 bs=2048" and unfortunaly does not contains ZERO byte in first 1024 bytes.</li></ol></div></div><div> </div><div>-- </div><div>Best
regards,</div><div>Navrotskiy Artem</div><div> </div>
Mime
View raw message