accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Tubbs (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-840) Allow String-based getBytes calls to pick Charset ending from JVM setting.
Date Wed, 31 Oct 2012 18:02:12 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488024#comment-13488024
] 

Christopher Tubbs commented on ACCUMULO-840:
--------------------------------------------

There are two issues here. The first is establishing a standard encoding for all Accumulo
internal persistent state/metadata, and the second is how to automatically encode API convenience
methods that accept String or char[] or CharSequence (from here on, I'll refer to these three
collectively as "Strings"). I'll deal with the latter first:

API: It is important to note that Accumulo deals only with bytes. That's it. We don't guarantee
a sort order for Strings with arbitrary (or configurable) encoding, though some have asked
for custom comparators to achieve fine-grained control over this. Instead, we only guarantee
a sort order for bytes, sorted numerically byte-by-byte, from most significant to least. It
is important to realize that we only deal with bytes internally, because all of the API decisions
appear to be centered around that idea. This is why you almost always see a Text object, because
it holds an arbitrary byte array. It is true that Text has a constructor that accepts a String,
and it has a very specific encoding when it does so (UTF8 only, as per its documentation).
We have copied this behavior in some of our APIs to add convenience methods that accept Strings,
because it's easier than forcing users to do write {code:java}new Mutation(new Text("myString".getBytes("UTF8")));{code}
It is so much easier to do {code:java}new Mutation("myString");{code}. This does not change
the behavior, though. We still expect convenience methods that accept Strings to behave as
though we had converted a String to UTF8 and passed in the resulting bytes (in a Text object)
to the method.

API (cont.): Now, it may be the case that the API could benefit from convenience wrappers
that accept Strings with a specific encoding, or we could change the behavior of those we
have to respect the JVM's "file.encoding" property, and simply pre-encode the Strings before
we throw their resulting bytes into a Text object. This may be useful and convenient, but
this is a VERY LIMITED SCOPE, and it's important to realize that any consideration of changes
to the way we encode things should focus on this scope, and not go crazy, changing all instances
of "String-based" uses of ".getBytes()" in the code. Regardless of whether we make such changes,
though, we should update our Javadocs to ensure that the encoding we use for these convenience
methods is described. It is in the case of Mutation... I'm not sure about elsewhere.

INTERNAL: The other scope to consider for encoding has to do with our internal storage (metadata
we store in Zookeeper, in the !METADATA table, and other places where Accumulo writes persistent
state). It is imperative that we maintain consistency in the way we interpret our persistent
state. For this scope, we absolutely should stick to an encoding, but it should be hard-coded
(use a Constant or a util method, for convenience), and should not respect any user configurable
field. This is important, because a user should be able to change his/her JVM's encoding settings
(for the API scope described above) and it should *NOT* affect our ability to read and understand
data that we've previously written to Zookeeper or !METADATA (or elsewhere).

INTERNAL (cont.): For the internal, persistent state's encoding, I'm comfortable assuming
that we're already treating all persistent Strings storage as UTF-8 encoded (because we move
things around in Text objects a lot, and for those things we aren't, we're probably using
ASCII, and can safely treat it as UTF-8). If there are any situations where we are storing
persistent state ambiguously, based on anything other than the hard-coded UTF-8 encoding,
such that it might cause a problem if a user were to change an OS setting, or non-ASCII data
can find its way in, we should treat such as a bug.

As far as I see it, these are the only two scopes we need to concern ourselves with when considering
encoding.
                
> Allow String-based getBytes calls to pick Charset ending from JVM setting.
> --------------------------------------------------------------------------
>
>                 Key: ACCUMULO-840
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-840
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: David Medinets
>            Assignee: David Medinets
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> ACCUMULO-836 changed all String-based getBytes() calls to use the UTF-8 standard. However,
there is a JVM setting called "jvm.encoding" that should be honored. See http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding
for a discussion of JAVA_TOOL_OPTIONS which might be relevant to this topic. http://javarevisited.blogspot.com/2012/01/get-set-default-character-encoding.html
is also a good page to read especially the comment on how character encoding is cached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message