commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <seb...@gmail.com>
Subject [CODEC] how to handle invalid encode/decode input?
Date Sun, 26 Mar 2017 23:13:46 GMT
Various Codec methods need to encode and decode bytes/Strings.

Not all byte sequences can be decoded into Strings, and not all
Strings can be encoded into bytes.

So a decision has to be made as to what to do when an invalid sequence
is detected.

At present the encoding/decoding is done by the String class
The Javadoc for methods that use a Charset say:

"This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement" (byte array or
String depending on direction)

However the Javadoc for methods that specify the charset name as a String say:

"The behavior of this method when this string cannot be encoded in the
given charset is unspecified"

It looks as though the "unspecified" behaviour is to replace invalid
sequences, but this cannot be guaranteed across all JVMs.

That can easily be fixed by ensuring that the code only ever uses the
methods that take a Charset.

However it's not obvious that replacement is the correct policy.

See for example:

CODEC-228 URLCodec.decode does not throw DecoderException with invalid UTF-8

It seems to me it would be better to report errors.

At present, the result of a round-trip encode/decode sequence may not
result in the original input.
That seems wrong for Codec, which IMO should be able to accurately
encode and decode its input.
At present conversions may be silently 'adjusted'.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message