accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <>
Subject Re: Setting Charset in getBytes() call.
Date Mon, 29 Oct 2012 20:29:04 GMT
Anytime that I've encountered non-English character sets, the answer
has been to use UTF-8. I'm moving forward with that assumption since
it is safe change. If the group decides to use a different default
encoding, it will be trivial to build on the work that I've done
identifying getBytes() calls. I will post a list of files and my
methodology before a svn checkin.

On Mon, Oct 29, 2012 at 4:02 PM, Benson Margulies <> wrote:
> On Mon, Oct 29, 2012 at 3:18 PM, John Vines <> wrote:
>> So perhaps we should have ISO-8859-1 as the standard. Mike- do you see any
>> reason to use something beside ISO-8859-1 for the encodings?
> I object and caution against *any* plan that involves transcoding from
> X to UTF-16 and back where when the data is not always going to be
> valid bytes of encoding X. The only clean solution here is to have an
> API entirely in terms of bytes, and either let the user do getBytes if
> they want to store string data, or provide additional API.
>> John
>> On Mon, Oct 29, 2012 at 3:14 PM, Michael Flester <> wrote:
>>> > UTF-8 should always be present (according to the JLS), and as a
>>> multi-byte
>>> > format should be able to encode any character that you would need to.
>>> >
>>> UTF-8 cannot encode arbitrary data. All data that we store in accumulo
>>> is not characters. A safe encoding to use as a pass through when you
>>> don't know if you are dealing with characters is ISO-8859-1 since we know
>>> that we can make the round trip from bytes to string to bytes without loss.

View raw message