accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <>
Subject Re: Setting Charset in getBytes() call.
Date Tue, 30 Oct 2012 01:22:01 GMT
I have always wondered if there were cases in the API where users are
forced to use Text when they would otherwise prefer byte[], e.g: stuffing a
non utf8 byte[] into a Text object to facilitate storage or sorting. Not
entirely sure whether Text would complain if this were the case. I suspect
we should seek to elimimate these if they currently exist.

Speaking strictly of user data, I agree that fundamentally, every operation
should be based upon byte[]. API methods providing Text and String based
calls should be convience methods where the conversion of text to/from
bytes is handled explicitly (not relying on platform default encoding or
properties) and transparently (doing something sensible when the user
doesn't care or is unaware of the issues surrounding character encoding).

Regarding utf8, is there a need to support arbitrary character encodings
when persisting bytes to accumulo? Think byte order for lexical sorting,
fixed vs variable length, etc. Perhaps it would not be unreasonable to
support explicitly stating a character encoding on table creation?

 On Oct 29, 2012 8:47 PM, "Josh Elser" <> wrote:

> +1 Mike.
> 1. It would be hard for me to believe Key/Value are ever handled
> internally in terms of Strings, but, if such a case does exist, it would be
> extremely prudent to fix.
> 2. FWIW, the Shell does use ISO-8859-1 as its charset which is referenced
> by other commands [1,2]. It would be good to double check all of the other
> commands.
> [1]**accumulo/blob/trunk/core/src/**
> main/java/org/apache/accumulo/**core/util/shell/<>
> [2]**accumulo/blob/trunk/core/src/**
> main/java/org/apache/accumulo/**core/util/shell/commands/**
> On 10/29/2012 8:27 PM, Michael Flester wrote:
>> I agree with Benson entirely with one caveat. It seems to me that there
>> might be two categories of things being discussed
>>    1. User data (keys and values)
>>    2. Ancillary things needed for operation of Accumulo (passwords).
>> These could well be considered separately. Trying to do anything with
>> keys and values other than treating them as bytes all of the time
>> I find quite scary.
>> And if this is only being done to satisfy pmd or findbugs, those tools
>> can be convinced to modify their reporting about this issue.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message