ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Ozerov <voze...@gridgain.com>
Subject Re: IGNITE-5655: Mixing binary string encodings in Ignite cluster
Date Mon, 11 Sep 2017 08:01:42 GMT
Dima,

You contradict yourself - vote for per-column encoding on the one hand, but
telling that it is "over-architected" on the other. This is exactly what I
am talking about - anything more that hard-coded cluster-wide encoding is
complex. You cannot simply define per-column encoding. In addition you
should either pass information about this encoding too all cluster members
and to all clients, so that they construct correct binary object in the
first place, or you should re-convert binary object on fly, this is what I
suggested. No simple solution here.

I vote for cluster-wide encoding for now, but with transparent conversion
when needed.


On Thu, Sep 7, 2017 at 4:50 AM, Dmitriy Setrakyan <dsetrakyan@apache.org>
wrote:

> I would agree with Andrey, it does look a bit over-architected to me. Why
> would anyone try to move data from one encoding to another? Is it a real
> use case that needs to be handled automatically?
>
> Here is what I think we should handle:
>
>    1. Ability to set cluster-wide encoding. This should be easy.
>    2. Ability to set per-column encoding. Such encoding should be set on
>    per-column level, perhaps at cache creation or table creation. For
> example,
>    at the cache creation time, we could let user define all column names
> that
>    will have non-default encodings.
>
> Thoughts?
>
> D.
>
> On Wed, Sep 6, 2017 at 6:27 AM, Andrey Kuznetsov <stkuzma@gmail.com>
> wrote:
>
> > As of option #1, it's not so bad. Currently we've implemented global
> level
> > encoding switch, and this looks similar to DBMS: if server works with
> > certain encoding, then all clients should be configured to use the same
> > encoding for correct string processing.
> >
> > Option #2 provokes a number of questions.
> >
> > What are performance implications of such hidden binary reencoding?
> >
> > Who will check for possible data loss on transparent reencoding (when
> > object walks between caches/fields with distinct encodings)?
> >
> > How should we handle nested binary objects? On the one hand, they should
> be
> > reencoded in a way described by Vladimir. On the other hand, BinaryObject
> > is an independent entity, that can be serialized/deserialized freely,
> moved
> > between various data structures, etc. It will be frustrating for user to
> > find its binary state changed after storing in a grid, with possible data
> > corruption.
> >
> >
> > As far as I can see, we are trying to couple orthogonal APIs:
> > BinaryMarshaller, IgniteCache and SQL. BinaryMarshaller is
> > Java-datatype-driven, it creates 1-to-1 mapping between Java types and
> > their binary representations, and now we are trying to map two binary
> types
> > (STRING and ENCODED_STRING) to single String class. IgniteCache is much
> > more flexible API, than SQL, but it lacks encoded string datatype, that
> > exists in SQLs of some RDBMSs: `varchar(n) character set some_charset`.
> > It's not a popular idea, but many problems could be solved by adding such
> > type. Those IgniteCache API users who don't need it won't use it, but it
> > could become a bridge between SQL and BinaryMarshaller encoded-string
> > types.
> >
> > 2017-09-06 10:32 GMT+03:00 Vladimir Ozerov <vozerov@gridgain.com>:
> >
> > > What we tried to achieve is that several encoding could co-exist in a
> > > single cluster or even single cache. This would be great from UX
> > > perspective. However, from what Andrey wrote, I understand that this
> > would
> > > be pretty hard to achieve as we rely heavily on similar binary
> > > representation of objects being compared. That said, while this could
> > work
> > > for SQL with some adjustments, we will have severe problems with
> > > BinaryObject.equals().
> > >
> > > Let's think on how we can resolve this. I see two options:
> > > 1) Allow only single encoding in the whole cluster. Easy to implement,
> > but
> > > very bad from usability perspective. Especially this would affect
> > clients -
> > > client nodes, and what is worse, drivers and thin clients! They all
> would
> > > have to bother about which encoding to use. But may be we can share
> this
> > > information during handshake (as every client has a handshake).
> > >
> > > 2) Add custom eocnding flag/ID to object header if non-standard
> enconding
> > > appears somewhere inside the object (even in nested objects). This way,
> > we
> > > will be able to re-create the object if needed if expected and actual
> > > encoding doesn't match. For example, consider we have two caches/tables
> > > with different encoding (not implemented in current iteration, but we
> may
> > > decide to implement per-cache encodings in future, as this any RDBMS
> > > support it). And then I decide to move object A from cache 1 with UTF-8
> > > encoding to cache 2 with Cp1251 encoding. In this case I will detect
> > > encoding mismatch through object header (or footer) and re-build it
> > > transparently for user.
> > >
> > > Second option is more preferable to me as a long-term solution, but
> would
> > > require =more efforts.
> > >
> > > Thoughts?
> > >
> > > --
> > Best regards,
> >   Andrey Kuznetsov.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message