Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14062D71E for ; Wed, 31 Oct 2012 00:32:19 +0000 (UTC) Received: (qmail 23233 invoked by uid 500); 31 Oct 2012 00:32:18 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 23202 invoked by uid 500); 31 Oct 2012 00:32:18 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 23194 invoked by uid 99); 31 Oct 2012 00:32:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Oct 2012 00:32:18 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of bimargulies@gmail.com designates 209.85.223.169 as permitted sender) Received: from [209.85.223.169] (HELO mail-ie0-f169.google.com) (209.85.223.169) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Oct 2012 00:32:12 +0000 Received: by mail-ie0-f169.google.com with SMTP id 10so1619789ied.0 for ; Tue, 30 Oct 2012 17:31:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=H/4AQYDOQs/u3nffuYD18qqElKG3PHbxLjGLBDqnhsQ=; b=ewlcQlWk8kz3Yhn9bRWTC7X/iKKMS1drOehNG6zAklEqD5vJc+8tKmcvCA2NJ0KqyL +PWK1+JCdH4Y34244JGi8fcHhGanqirXGGw/1jYNBHMgcJHtRtzHXcS9buvU7tVSR+EZ JfA0hYDXBW+2rLdmxlq9xtF6Rry3xAeRupAmCEIeQqaz+VFjumV3i/zNKd89tzmXdSGA h2r/mWR8mCykH2dHDGNorr+iN0NhZRclNjICnQ/4l8XnOuoIN3vH0DcG7L/hJxNIeSbq dCEBvVLK1Oi33v5suTl+9n2bYauSgCn69mO187ElhHLsdN6X9WUNCm1IBE/iu2lEqJ4Q yWCg== MIME-Version: 1.0 Received: by 10.50.40.138 with SMTP id x10mr3221152igk.41.1351643511429; Tue, 30 Oct 2012 17:31:51 -0700 (PDT) Received: by 10.42.67.203 with HTTP; Tue, 30 Oct 2012 17:31:51 -0700 (PDT) In-Reply-To: <50906F21.9040901@gmail.com> References: <508EB572.2020507@gmail.com> <508F236F.6070108@gmail.com> <508F3408.5070203@gmail.com> <5090546F.5040009@gmail.com> <50906F21.9040901@gmail.com> Date: Tue, 30 Oct 2012 20:31:51 -0400 Message-ID: Subject: Re: Setting Charset in getBytes() call. From: Benson Margulies To: dev@accumulo.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Oct 30, 2012 at 8:21 PM, Josh Elser wrote: > On 10/30/2012 7:47 PM, David Medinets wrote: >>> >>> My issue with this is that you have now hard-coded the fact that everyone >>> else is going to use UTF-8. >> >> >> Who is everyone else? I agree that I have hard-coded the use of UTF-8. >> On the other hand, I've merely codified an existing practice. Thus the >> issue is now exposed, the places the convention is used are defined. >> Once a consensus is reached, we can implement it with confidence. > > > "Everyone else" is everyone who builds Accumulo since you committed your > changes and uses it. Ignoring that, forcing a single charset isn't the big > issue here (as we've *all* agreed that UTF-8 should not cause any > data-correctness issues) so for now I'll just drop it as it's just creating > confusion. > > My issue is *how* you implemented the default charset. We already have 3 > people (Marc, Bill and myself) who have stated that we believe inline > charset declaration is not the correct implementation and that using the JVM > property is the better implementation. > > I'd encourage others to weigh in to form a complete consensus and shift the > discussion to that implementation if needed. > >> >>> way to fix the problem. I still contest that setting the desired encoding >>> (via the appropriate JVM property like Bill Slacum initial suggested) is >>> the >>> proper way to address the issue. >> >> >> It is easy to do both. Create a ByteEncodingInitializer (or somesuch) >> class that reads the JVM property and defines a globally used Charset. >> The find those utf8 definitions and usages and replace them with the >> globally-defined value. > > > Again, by setting the 'file.encoding' JVM parameter, such a class is > unnecessary because it should be handled internal to Java. For Oracle/Sun > JDK and OpenJDK, setting the "file.encoding" parameter at run time will use > the provided charset you wanted without actually changing any code. If Accumulo was only a pile of servers, you could do this. You could say that part of the configuration process for the servers is to specify the desired encoding to file.encoding, and your shell scripts could set UTF-8 by default. But Accumulo is *not* just a pile of servers. Setting file.encoding effects the entire JVM. A webapp that uses Accumulo now would need to have the entire servlet container have a particular setting of file.encoding. This just does not work in the wild. Even without the servlet container issue, a user of Accumulo may need to plug it into an existing code base that has other reasons to set file.encoding, and will not like it when Accumulo starts to corrupt his or her string data.