accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Drob <md...@mdrob.com>
Subject Re: Setting Charset in getBytes() call.
Date Mon, 29 Oct 2012 17:16:48 GMT
One specific use case is when creating a new connection, the password is
passed as a byte[], when I expect most sane applications will treat it as a
String (either via reading it from a file, or reading it from a terminal
input). If somebody creates the password with a different platform encoding
than what the programmer expects, then it will cause a lock out that is
very difficult to debug.

On topic to the original question, if anybody is brave enough to use Java
7, then there are predefined constants in the JDK -
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html

UTF-8 should always be present (according to the JLS), and as a multi-byte
format should be able to encode any character that you would need to. I've
had this conversation with Keith before, so hopefully he can weigh in on
this.

Mike



On Mon, Oct 29, 2012 at 12:57 PM, Josh Elser <josh.elser@gmail.com> wrote:

> Benson, perhaps "contrived" would have been better than "hypothetical" :).
> That being said, I also hadn't thought about other JVM implementations.
>
> I wonder if leaving a commented note in the accumulo-env.sh script for
> alternative namings for the "file.encoding" name and the JVM it applies to
> would be sufficient?
>
> David, can you give some sort of feel for the usages of the getBytes()
> calls? Since most of the API deals with things in terms of Text and byte[]
> (Key and Value decomposed), are most of the usages configuration/user-input
> based as your initial snippet from InputFormatBase showed?
>
>
> On 10/29/2012 12:42 PM, John Vines wrote:
>
>> Are there any experts when it comes to character encodings? First of all,
>> I
>> would like to make sure there are no sacrifices being made by forcing
>> UTF-8.
>>
>>  From there, if I think JVM properties is the way to go. Should there be
>> ANY
>> sort of shortfall with UTF-8, we should allow users to switch the encoding
>> to the type of their pleasure. We can tweak the scripts to set the jvm
>> property but still allow users to override should they need it in their
>> setup. This allows us to not only avoid a massive code change, it also
>> makes it easier for users to switch to an encoding should they have a need
>> to.
>>
>> John
>>
>> On Mon, Oct 29, 2012 at 12:24 PM, Benson Margulies <bimargulies@gmail.com
>> >wrote:
>>
>>  On Mon, Oct 29, 2012 at 12:21 PM, Josh Elser <josh.elser@gmail.com>
>>> wrote:
>>>
>>>> David, I beg to differ.
>>>>
>>>> Setting it via the JVM property is a single change to make, whereas if
>>>>
>>> you
>>>
>>>> change every single usage of getBytes(), you now forced the next person
>>>>
>>> to
>>>
>>>> branch the code, change everything to UTF16 (hypothetical use case) and
>>>> continue a diverged codebase forever.
>>>>
>>>
>>> Typically, the reason(s) that people don't take this approach are:
>>>
>>> a: a fear that other JVMs don't have this parameter, or don't have it
>>> under the same name.
>>> b: a desire to read or write files for uses in 'the platform encoding'
>>> whatever it is, in addition to whatever needs to be done in UTF-8.
>>>
>>> I'd be very surprised if Accumulo ever decided to do this sort of
>>> thing in UTF-16.
>>>
>>>
>>>
>>>> I would say that the reason that such a JVM property exists is to
>>>>
>>> alleviate
>>>
>>>> you from having to make these code changes in the first place.
>>>>
>>>> On 10/29/2012 12:00 PM, David Medinets wrote:
>>>>
>>>>>
>>>>> I like the idea of making the change explicit in the source code.
>>>>> Setting the encoding in the jvm property would be easier but not as
>>>>> explicit. I have a few dozen of the files changed. Today I have free
>>>>> time since Hurricane Sandy has closed offices.
>>>>>
>>>>> On Mon, Oct 29, 2012 at 11:39 AM, William Slacum
>>>>> <wilhelm.von.cloud@accumulo.**net <wilhelm.von.cloud@accumulo.net>>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Isn't it easier to just set the JVM property `file.encoding`?
>>>>>>
>>>>>> On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey <ekohlwey@gmail.com>
>>>>>>
>>>>> wrote:
>>>
>>>>
>>>>>>  If you use a private static field in each class for the charset,
it
>>>>>>>
>>>>>> will
>>>
>>>> basically be a singleton because charsets are cached in char
>>>>>>> set.forname.
>>>>>>> IMHO this is a somewhat cleaner approach than having lots of
static
>>>>>>> imports
>>>>>>> to utility classes with lots of constants in them.
>>>>>>> On Oct 28, 2012 5:50 PM, "David Medinets" <david.medinets@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>  https://issues.apache.org/**jira/browse/ACCUMULO-241?**
>>> focusedCommentId=13449680&**page=com.atlassian.jira.**
>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13449680<https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680>
>>>
>>>>
>>>>>>>>
>>>>>>>> In this comment, John mentioned that all getBytes() method
calls
>>>>>>>> should be changed to use UTF8. There are about 1,800 getBytes()
>>>>>>>> calls
>>>>>>>> and not all of them involve String objects. I am working
on ways to
>>>>>>>> identify a subset of these calls to change.
>>>>>>>>
>>>>>>>> I have created https://issues.apache.org/**jira/browse/ACCUMULO-836<https://issues.apache.org/jira/browse/ACCUMULO-836>to
>>>>>>>> track this issue.
>>>>>>>>
>>>>>>>> Should we create one static Charset object?
>>>>>>>>
>>>>>>>>     Class AccumuloDefaultCharset {
>>>>>>>>       public static Charset UTF8 = Charset.forName("UTF8");
>>>>>>>>     }
>>>>>>>>
>>>>>>>> Should we use a static constant?
>>>>>>>>
>>>>>>>>     public static String UTF8 = "UTF8";
>>>>>>>>
>>>>>>>> I have found one instance of getBytes() in InputFormatBase:
>>>>>>>>
>>>>>>>>     protected static byte[] getPassword(Configuration conf)
{
>>>>>>>>       return Base64.decodeBase64(conf.get(**PASSWORD,
>>>>>>>> "").getBytes());
>>>>>>>>     }
>>>>>>>>
>>>>>>>> Are there any reasons why I can't start specifying the charset?
Is
>>>>>>>> UTF8 the right Charset to use? I am not an expert in non-English
>>>>>>>> charsets, so guidance would be welcome.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message