ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Ozerov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-5655) Introduce pluggable string encoder/decoder
Date Wed, 23 Aug 2017 12:16:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138271#comment-16138271
] 

Vladimir Ozerov commented on IGNITE-5655:
-----------------------------------------

[~andrey-kuznetsov], my comments:
1) {{BinaryConfiguration.encoding}} - passing it as a {{byte}} is bad idea as it limits total
number of available encodings. We should pass it as {{String}}, in the same way as it is done
in Java classes. In case encoding is not supported, an exception should be thrown.
2) {{OdbcColumnMeta, OdbcMessageParser}} - unnecessary changes; ODBC doesn't distinguish between
String types, this is our internal implementation detail.
3) {{SqlListenerNioListener}} - another unnecessary change. We encode strings to save space
in storage and page memory. Encoding strings when sending error message doesn't make sense.
4) {{SqlListenerUtils}} - unnecessary changes, as encoded strings are not supported in JDBC/ODBC
at the moment (I do not see any changes in drivers). Hence, we should always talk to drivers
using {{UTF-8}} strings.
5) {{BinaryStringEncoding}} - should be a part of public API, so let's move it to {{org.apache.ignite.binary}}
package.
6) {{BinaryReaderExImpl.fieldFlagNames}} - looks like an overkill to me. Just concatenate
two well-known flags by hand.
7) {{BinarySerializedFieldComparator}} - it is illegal to ignore encoding byte. You may endup
with situation where two different strings are encoded with different encoding, but have similar
binary representation. Your code will report {{true}}, while {{false}} should be returned.
Correct flow here:
7.1) If objects have different encodings - deserialize them and compare using String.compareTo
7.2) If encodings are same, we need to understand whether encoded representation is comparable.
For example, suppose that in certain encoding {{A}} represented as {{2}}, and {{B}} is represented
as {{1}}. Then {{A.compareTo(B) == -1}}, but {{compareArrays(serialize(A), serialize(B) ==
1}}, which is wrong. E.g. see \[1\]. We need to think on how to expose this properly, but
I think this is a different task. For now let's just restrict any binary comparison of non-UTF8
strings and always deserialize them.
8) Feature is definitely undertested. At the very least we need the following tests:
8.1) Run SQL queries with custom encoding
8.2) Start two nodes with different encoding and see how they interact with each other.

Not covered cases which I propose to implement separately: ODBC, JDBC, C++, .NET, efficient
ocmparison for inlined indexes and BinarySerializedFieldComparator.

The rest looks good to me. Thanks for contribution!

\[1\] https://dev.mysql.com/doc/refman/5.7/en/charset-general.html

> Introduce pluggable string encoder/decoder
> ------------------------------------------
>
>                 Key: IGNITE-5655
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5655
>             Project: Ignite
>          Issue Type: New Feature
>          Components: binary
>    Affects Versions: 2.0
>            Reporter: Valentin Kulichenko
>            Assignee: Andrey Kuznetsov
>             Fix For: 2.2
>
>
> Currently binary marshaller encodes strings in UTF-8. However, sometimes it makes sense
to serialize strings with different encodings to save space. Let's add global property to
control String encoding and customize our binary protocol to support it. For instance, we
can add another flag {{ENCODED_STRING}}, which will write strings as follows:
> [flag][encoding_flag][str_len][str_bytes]
> First implementation should set preferred encoding for strings in BinaryConfiguration.
This setting is optional, default encoding is UTF-8. Currently, the same BinaryConfiguration
is used for all cluster nodes, thus no encoding clashes are possible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message