accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marc Parisi <>
Subject Re: Setting Charset in getBytes() call.
Date Fri, 02 Nov 2012 12:24:43 GMT
John, that would lead us to a configuration management issue. To keep
configuration files in line would be the same as ensuring file.encoding is
the same across the platform.

The JLS doesn't specify a Charset encoding scheme; however, for quite some
time the file.encoding fall through ( that is, when it's not specified ) is
UTF-8. This could change, and is not backed by the JLS, yet, file.encoding
is. It's a fallthrough, meant to take care of configuration mismanagement.

Further, these changes will have issues if you specify a file.encoding in
your configuration, as you don't always enforce UTF-8 in every String
instance, especially in some of your aggregator changes.

"If Accumulo was only a pile of servers, you could do this. You could
say that part of the configuration process for the servers is to
specify the desired encoding to file.encoding, and your shell scripts
could set UTF-8 by default.

But Accumulo is *not* just a pile of servers. Setting file.encoding
effects the entire JVM. A webapp that uses Accumulo now would need to
have the entire servlet container have a particular setting of
file.encoding. This just does not work in the wild. Even without the
servlet container issue, a user of Accumulo may need to plug it into
an existing code base that has other reasons to set file.encoding, and
will not like it when Accumulo starts to corrupt his or her string

I gather that what you mean is that multiple, transient, execution paths
within the tserver should support multiple encodings; however, setting
file.encoding ensures that the platform, which is encompassed in a JVM,
encodes and decodes values in an understood way ( that's what character set
encodings are meant to enforce ). If a user wishes to have his or her own
execution path ( or their own encoding for an iterator ), then he/she would
likely define this. The fact that we require configuration parameters for
the bulk of these changes in core is an indication that the core API
contains features that are seeping into user functionality. Keep the
encoding/decoding at client code, not within the tserver process. Use
file.encoding for the core project, and our changeset would be much
smaller, require that clients do their own encoding/decoding.

A webapp is a fantastic example; however, let's take it a step further.
Accumulo is JBoss. The iterator/client code is the webapp. We should
separate Accumulo from client and client iterator code to avoid these
design issues and place the onus on the user, not accumulo. In all honesty,
and I'm probably off base, but in the case of iterators, we should move
them to a different package, and if so desired, add options to the
iterators, but there is no need to default to UTF-8. It's been that way for
some time.

On Wed, Oct 31, 2012 at 2:02 PM, Christopher Tubbs <>wrote:

> I've added my own comments to this thread on the ACCUMULO-840 ticket.
> --
> Christopher L Tubbs II
> On Tue, Oct 30, 2012 at 10:35 PM, John Vines <> wrote:
> > Why not just have a configuration in the xml file for setting a global
> > charset? This way we avoid hard coded settings but also avoid the issue
> of
> > shared vm issues.
> >
> > John
> >
> > Sent from my phone, pardon the typos and brevity.
> > On Oct 30, 2012 10:29 PM, "David Medinets" <>
> wrote:
> >
> >> Re-reading and re-thinking I can see your point about how, by
> >> specifying UTF-8, Accumulo is now flouting the file.encoding
> >> parameter. I'd like to implement a static method inside
> >> core/src/main/java/org/apache/accumulo/core/util/ Then
> >> do something like getBytes(Encoding.getCharset()) instead of
> >> hard-coding UTF-8.
> >>
> >> Class Encoding {
> >>   private static final Charset charset = null;
> >>   public Charset getCharset() {
> >>     if (charset == null) {
> >>       charset = Charset.forName(System.getProperty("file.encoding",
> >> "UTF-8"));
> >>     }
> >>     return charset;
> >>   }
> >>   ...
> >> }
> >>

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message