Return-Path: X-Original-To: apmail-subversion-dev-archive@minotaur.apache.org Delivered-To: apmail-subversion-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81E0B17AD1 for ; Fri, 3 Oct 2014 12:15:48 +0000 (UTC) Received: (qmail 32301 invoked by uid 500); 3 Oct 2014 12:15:48 -0000 Delivered-To: apmail-subversion-dev-archive@subversion.apache.org Received: (qmail 32247 invoked by uid 500); 3 Oct 2014 12:15:48 -0000 Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@subversion.apache.org Received: (qmail 32232 invoked by uid 99); 3 Oct 2014 12:15:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Oct 2014 12:15:47 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.82.99.236] (HELO nm5-vm8.bt.bullet.mail.ir2.yahoo.com) (212.82.99.236) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Oct 2014 12:15:21 +0000 Received: from [212.82.98.46] by nm5.bt.bullet.mail.ir2.yahoo.com with NNFMP; 03 Oct 2014 12:15:19 -0000 Received: from [212.82.98.38] by tm7.bt.bullet.mail.ir2.yahoo.com with NNFMP; 03 Oct 2014 12:15:19 -0000 Received: from [127.0.0.1] by omp1007.bt.mail.ir2.yahoo.com with NNFMP; 03 Oct 2014 12:15:19 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 664371.8096.bm@omp1007.bt.mail.ir2.yahoo.com Received: (qmail 76352 invoked by uid 60001); 3 Oct 2014 12:15:19 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=btopenworld.com; s=s1024; t=1412338519; bh=ZBb9qcGLBDvCMQgqMuGKFf8XMXeUCbZ7Uc7SnTbDw1g=; h=References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=jNrmmk64f5y2K2+brTw1nPuiChCggJ/xLCOMSqzKx8oUH8X/Z4lCjmcC6E31efxy0YkuY2jmLzulkO4R6BGb9dRgxhkr5ZVbq0rda8CIzBDMUQO+BJEfmo9XAp8a0dOQDhoT0DpAwcAQzBkSzFOhGKbCSCBWBRGMdNw9QxgXSdU= X-YMail-OSG: Len2MaIVM1lhAHRjvUQ740lolr5p.hfr406RfOyr5PoXFFY oS0Qx4EzegxM7RQVMGqakJ2rDMjWSDv2r0jX_MI14.Ikfi.gxtRYIGB8CxLW jWiqxCujl.YZNzl_DpAm6XWMTYR.pqjFAPHBjU0Ab4VhQko_BDL81kMXTxVL x5SDG9Ix524OA6GaXo36rN6WxbESTlpTbr_s2NZCCeUfn6Q0y1tAvE.ciSsw AZ2oeSB9gRMqVNVJOwKcfFP0oA1G1LR3W1XSfDtmSPGEz3OGViOTrzg3684. OFQfPoRmbltxRBGrbYOOuGlvSjsE.EfHEXuxEjhCbMYNLBD9dprOnrjtCA1J U4FDH3F9V3bAQqJAJ6kNY6az0AWqps2KpkjhyXn0yTQ8N7KhTKvS5z6GeiOS RbbZBgHICk3znPKsm3yTUDczP01qjRjNatk2lggWaioXVn8ZxStBeu2x6ACx pxmO942UnGIEk58nPh610W5J3kCIzJQb6o0EfZmKr_MJY3qVqkMI2bauaVjk IGMGeaY3ZkBtP93yTIQqqap91rysh6RJ_5_dcEbiR3vVIILujfakVAqAr0yA t7YwQ4feFDohm5cJKKIQWsb2Qx_Y9t80MmKH_6.jzgWZyqiQ2fc87j90DDYN qR877J38aE5XgVG0kgrrqWaxGn05fQmMrS4Eb5A-- Received: from [83.216.134.177] by web87701.mail.ir2.yahoo.com via HTTP; Fri, 03 Oct 2014 13:15:19 BST X-Rocket-MIMEInfo: 002.001,U3RlZmFuIFNwZXJsaW5nIHdyb3RlOgo.IE9uIEZyaSwgT2N0IDAzLCAyMDE0IGF0IDExOjI2OjMyQU0gKzA0MDAsIE5hdnJvdHNraXkgQXJ0ZW0gd3JvdGU6Cj4.wqAgwqAgU3VidmVyc2lvbiBjb25zb2xlIGNsaWVudCB0cnkgdG8gZGV0ZWN0IGJpbmFyeSBmaWxlIHdpdGggYWxnb3J5dGhtOgo.PiAKPj7CoCDCoMKgwqAxLiBGaWxlIGlzIE5PVCBCSU5BUlkgaWYgaXQgY29udGFpbnMgb25seSBCT00gVVRGLTggc2lnbmF0dXJlICh3aHkgbm90Cj4.wqAgwqAgwqAgwqAgY2hlY2sgYXMgZmlyc3QgTiBieXRlcyABMAEBAQE- X-Mailer: YahooMailWebService/0.8.203.696 References: <1048611412321192@web20g.yandex.ru> <20141003113501.GF1256@ted.stsp.name> Message-ID: <1412338519.68585.YahooMailNeo@web87701.mail.ir2.yahoo.com> Date: Fri, 3 Oct 2014 13:15:19 +0100 From: Julian Foad Reply-To: Julian Foad Subject: Re: Subversion binary file detection is look like broken To: Stefan Sperling , Navrotskiy Artem Cc: "dev@subversion.apache.org" In-Reply-To: <20141003113501.GF1256@ted.stsp.name> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Stefan Sperling wrote:=0A> On Fri, Oct 03, 2014 at 11:26:32AM +0400, Navrot= skiy Artem wrote:=0A>>=A0 =A0 Subversion console client try to detect binar= y file with algorythm:=0A>> =0A>>=A0 =A0=A0=A01. File is NOT BINARY if it c= ontains only BOM UTF-8 signature (why not=0A>>=A0 =A0 =A0 =A0 check as firs= t N bytes is corret UTF-8?);=0A>>=A0 =A0=A0=A02. File is BINARY if first 10= 24 bytes contains ZERO byte (uniform=0A>>=A0 =A0 =A0 =A0 distribution of by= tes takes change of absent ZERO byte: (1 - 1 /=0A>>=A0 =A0 =A0 =A0 256) ^ 1= 024 =3D ~1.8%);=0A>>=A0 =A0=A0=A03. File is BINARY if first 1024 bytes cont= ains over 85% of characters=0A>>=A0 =A0 =A0 =A0 not in range 0x07-0x0D, 0x2= 0-0x7F (total we have 153 "binary"=0A>>=A0 =A0 =A0 =A0 bytes, ~60%).=0A>> = =0A>>=A0 =A0 This algoritm looks like broken.=0A=0AThe requirement (3) for = >85% non-ASCII* bytes =3D> binary, was a historical accident. The =0Aorigin= al intention was >15% non-ASCII bytes =3D> binary, or in other words >85% A= SCII bytes =3D> text. Quoting from libsvn_subr/io.c:svn_io_is_binary_data()= :=0A=0A=A0=A0=A0=A0 NOTE:=A0 Originally, I intended to target 85% of the by= tes being in=0A=A0=A0=A0=A0 the specified ranges, but I flubbed the conditi= on.=A0 At any rate,=0A=A0=A0=A0=A0 folks aren't complaining, so I'm not sur= e that it's worth=0A=A0=A0=A0=A0 adjusting this retroactively now.=0A=0APer= haps now is the time to change that to match the original intent.=0A=0A* I = use the term ASCII loosely to mean "bytes in those two ranges".=0A=0A=0A> C= an you suggest a better algoritm?=0A> =0A>> For example:=0A>>=A0 =A0=A0 1. = File "text.txt":=0A>> Is file contains text block from wikipedia about subv= ersion in UTF-8=0A>> (https://ru.wikipedia.org/wiki/Subversion) and=A0unfor= tunaly contains too=0A>> many cyrillic charactes (on character - 2 "binary"= bytes).=0A>>=A0 =A0=A0 2. File "binary.txt" detected as "text"=0A>> It was= created by "dd if=3D/dev/urandom of=3Dbinary.txt count=3D1 bs=3D2048" and= =0A>> unfortunaly does not contains ZERO byte in first 1024 bytes.=0A=0ACha= nging the 85% condition would fix example 2. However it would make example = 1 occur more often, unless we also make valid UTF-8 be detected as text.=0A= =0AIt does sound like a good idea to make valid UTF-8 be detected as text.= =0A=0A- Julian