accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-836) Specify Charset on getBytes() call for String objects.
Date Fri, 02 Nov 2012 03:32:12 GMT


Josh Elser commented on ACCUMULO-836:

*GrepIterator*: It should be noted (javadoc) that the String being converted to bytes will
be treated as UTF-8 encoded bytes or not make the UTF-8 assertion at all. 

*MetadataTable#encode(), DistributedReadWriteLock#getLockData()*: Should note that the byte[]
return from the specified method is utf-8 bytes.

*LongCombiner.StringEncoder, StringMax, StringMin, StringSummation, SummingArrayCombiner.StringArrayEncoder,
Authorizations, Master#mergeMetadataRecords*: These classes are creating bytes that are UTF-8,
but when the bytes are initially read into a String (from a Value typically), the default
encoding is used (String constructor that takes a byte array). This leads to inconsistency
as the data could have been read as something other than UTF-8 but then written back out as
UTF-8. A decision needs to make what to do and that decision needs to be documented.

*ZooStore*: Some awkwardness pops out at me in #setProperty(long, String, Serializable) manually
adding bytes to the data to be written to ZooKeeper. I don't think UTF-8 will cause any problems,
but it could definitely use some clarification.

*TraceServer.Receiver, IndexMeta, AddFilesWithMissingEntries, MetadataTable*: Writes out a
Value in utf-8 bytes, but I'm not confident if there is any case in which a client reading
that data would expect something else. Documentation again would be useful. The likelihood
of this being an issue is probably small considering that Hadoop's WritableUtils encodes Strings
as UTF-8.

I'm still a little concerned about access points to ZooKeeper and !METADATA, but given that
ZooReaderWriter was converting the username and password as UTF-8 bytes I feel slightly better.
I should dig into that code more tomorrow.

One final statement, I still believe that in the ambiguous cases where core classes read arbitrary
bytes and write UTF-8 bytes, Accumulo should be agnostic and not make encoding assertions.
In other words, I think we should revert those changes and leave it up to the user to decide
how they handle their bytes.
> Specify Charset on getBytes() call for String objects.
> ------------------------------------------------------
>                 Key: ACCUMULO-836
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.0
>            Reporter: David Medinets
>            Assignee: David Medinets
>            Priority: Minor
>             Fix For: 1.5.0
>         Attachments:
> The comments on ACCUMULO-241 indicate that the build server might have a different default
Charset than computers used by developers. Therefore, some of the tests have different results
on different computers.
> Every getBytes call on a String object should specify the UTF8 Charset. Unfortunately
the codebase has nearly 1,800 getBytes calls.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message