Return-Path: X-Original-To: apmail-subversion-users-archive@minotaur.apache.org Delivered-To: apmail-subversion-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C5E1C106CF for ; Fri, 11 Oct 2013 17:25:54 +0000 (UTC) Received: (qmail 59854 invoked by uid 500); 11 Oct 2013 17:25:54 -0000 Delivered-To: apmail-subversion-users-archive@subversion.apache.org Received: (qmail 59579 invoked by uid 500); 11 Oct 2013 17:25:53 -0000 Mailing-List: contact users-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list users@subversion.apache.org Received: (qmail 59571 invoked by uid 99); 11 Oct 2013 17:25:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Oct 2013 17:25:52 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [192.109.42.8] (HELO einhorn.in-berlin.de) (192.109.42.8) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Oct 2013 17:25:45 +0000 X-Envelope-From: stsp@stsp.name Received: from ted.stsp.name (ted.stsp.name [217.197.84.34]) by einhorn.in-berlin.de (8.13.6/8.13.6/Debian-1) with ESMTP id r9BHPK6D022307 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Fri, 11 Oct 2013 19:25:20 +0200 Received: from ted.stsp.name (localhost [127.0.0.1]) by ted.stsp.name (8.14.7/8.14.3) with ESMTP id r9BHPKbp030422; Fri, 11 Oct 2013 19:25:20 +0200 (CEST) Received: (from stsp@localhost) by ted.stsp.name (8.14.7/8.14.7/Submit) id r9BHPJxh017816; Fri, 11 Oct 2013 19:25:19 +0200 (CEST) Date: Fri, 11 Oct 2013 19:25:19 +0200 From: Stefan Sperling To: Ben Reser Cc: Branko =?utf-8?B?xIxpYmVq?= , "users@subversion.apache.org" Subject: Re: SVN Blame Returns Corrupt Data Message-ID: <20131011172519.GE24900@ted.stsp.name> Mail-Followup-To: Ben Reser , Branko =?utf-8?B?xIxpYmVq?= , "users@subversion.apache.org" References: <52580763.7050608@wandisco.com> <52581247.9040707@wandisco.com> <52581CA2.9080401@wandisco.com> <525825C9.3020104@wandisco.com> <52582CCF.2020802@reser.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <52582CCF.2020802@reser.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-Scanned-By: MIMEDefang_at_IN-Berlin_e.V. on 192.109.42.8 X-Virus-Checked: Checked by ClamAV on apache.org On Fri, Oct 11, 2013 at 09:52:31AM -0700, Ben Reser wrote: > On 10/11/13 9:22 AM, Branko Čibej wrote: > > You'd have to extend Subversion's file type detection to detect UTF-16. > > See svn_io_detect_mimetype2 in line 3333 in this file: > > > > http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_subr/io.c?view=markup > > Subversion currently only looks at the first 1k Bytes of a file. It may > > be enough to check that this initial part of the file contains only > > valid UTF-16 (BE or LE) codes. > > Even if all we looked for is the BOM it might be helpful enough. I suspect the > development tools producing UTF-16 are including BOMs. Windows seems to be > fond of including them, Notepad puts one even on UTF-8. Couldn't Subversion automatically convert UTF-16 files to UTF-8 before processing them for diff/merge/blame, and convert output written to the original files back to UTF-16? That would require some work because existing streams, strings, and files passed around in the code would need to be wrapped so that translation to/from the internal from/to the external encoding is seamless. But I don't see why such an approach couldn't be made to work in principle. It might even result in some spring cleaning in the code base and pave the way for improved handling of file formats such as XML for diff and merge. What do you think? Is it worth adding this to our project ideas page?