ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Goncharuk <alexey.goncha...@gmail.com>
Subject Re: Default hash code generation strategy for new binary objects
Date Thu, 29 Sep 2016 16:14:00 GMT
In the assumption that we do not have 'non-equals' fields, binary array
comparison generally should be faster, because even  if we did a
field-by-field comparison, we would read the same amount of data anyway,
but also would need to do some byte jiggling for type conversion.

2016-09-29 19:07 GMT+03:00 Dmitriy Setrakyan <dsetrakyan@apache.org>:

> Alexey, in your opinion, what will be faster, the binary array comparison
> or field-by-field comparison?
>
> On Thu, Sep 29, 2016 at 8:39 AM, Alexey Goncharuk <
> alexey.goncharuk@gmail.com> wrote:
>
> > Folks, let me point out a few obvious (or not) things
> >
> > A set of fields participating in hashCode and equals is impossible to
> > change without cluster restart. Imagine a new client adding a field F to
> > key structure (A, B) so that new key is (A, B, F). In this case key (1,
> 2,
> > 0) will be treated as a different key w/respect to (1, 2, 3), while the
> new
> > client expects them to be the same. I think the set of fields defining
> the
> > key _must_ be used in both hash code and equals calculation and cannot
> > change over time. Having said that, we can use binary array comparison as
> > object equality test. Presence of new 'garbage' fields in key looks
> useless
> > to me.
> >
> > So, we should support:
> > 1) All the things Dmitriy pointed out, in other words - multi-field key
> > with affinity key fields
> > 2) Ability to instantiate the key using a text-only environment
> > 3) Ability to define the structure of a key using XML or DDL
> >
> > I think all of these can be done via combination of HashCodeResolver
> > interface + default resolver which can read type configuration.
> >
> > 2016-09-29 3:54 GMT+03:00 Dmitriy Setrakyan <dsetrakyan@apache.org>:
> >
> > > Guys,
> > >
> > > We need to look at 3 cases:
> > >
> > > a) key is just one field
> > > b) key is multiple fields
> > > c) key is one or multiple fields, with possibility of an alternate
> > affinity
> > > key
> > >
> > > For (a) and (b), whenever a type is defined in XML, and further in
> DML, a
> > > user will specify which fields are part of the key. In that case, we
> > should
> > > just grab those fields and calculate the hashcode automatically.
> > >
> > > For (c), a user should specify in XML, and further in DML, which fields
> > > should be used for affinity, in addition to the key fields. In that
> case,
> > > we again should grab those fields and calculate the hashcodes for the
> > > primary key and the affinity key.
> > >
> > > I really am not sure if there are other ways of doing it. Am I missing
> > > something?
> > >
> > > D.
> > >
> > >
> > > On Wed, Sep 28, 2016 at 5:14 PM, Denis Magda <dmagda@gridgain.com>
> > wrote:
> > >
> > > > Let me show the picture I have in my mind:
> > > >
> > > > Primary key is a must for all INSERT and MERGE operations. If it’s
> not
> > > set
> > > > then an INSERT/MERGE fails.
> > > > If a primary key is a boxed/unboxed primitive (int, Integer, String,
> > > Date,
> > > > etc.) then the key value is used for hashCode calculation. At the
> same
> > > time
> > > > the key will be an affinity key.
> > > > If a primary key is a custom object then it’s value can be passed as
> a
> > > > param directly from Java, .Net, C++ in a way like “INSERT (_key,
> > field1,
> > > > field2) VALUES (?, val1, val2)”. In this scenario we will call
> hashCode
> > > > directly on the key's value. In addition, we will be able to get an
> > > > affinity key since we know key’s type class descriptor
> > > > (BinaryClassDescriptor).
> > > > If a primary key is still a custom key and we want to insert its
> value
> > > > from an SQL console, PHP, Tableu, etc. then we can’t pass the key’s
> > value
> > > > as is. Here we’re trying to apply a workaround by listing key's
> fields
> > in
> > > > INSERT/MERGE and the task is to properly re-construct the key on our
> > side
> > > > using only specific fields.
> > > >
> > > > Is my understanding correct for all the bullets above?
> > > >
> > > > If so then, yes, I would agree that we need to list these fields in a
> > > > configuration and the default hash code resolver will use them as
> well.
> > > > Moreover, we have to pin point an affinity field. So, the question is
> > > what
> > > > the configuration we should use.
> > > >
> > > > Community, any other thoughts/ideas?
> > > >
> > > > —
> > > > Denis
> > > >
> > > > > On Sep 28, 2016, at 4:16 PM, Alexander Paschenko <
> > > > alexander.a.paschenko@gmail.com> wrote:
> > > > >
> > > > > Also MERGE.
> > > > >
> > > > > 2016-09-29 2:10 GMT+03:00 Denis Magda <dmagda@gridgain.com>:
> > > > >> You need a hash code only for INSERT operation, right?
> > > > >>
> > > > >> —
> > > > >> Denis
> > > > >>
> > > > >>> On Sep 28, 2016, at 3:47 PM, Alexander Paschenko <
> > > > alexander.a.paschenko@gmail.com> wrote:
> > > > >>>
> > > > >>> But what if the user works from some kind of console and
just
> types
> > > > >>> the queries as text in full and does not bind params via
JDBC or
> > > > >>> something alike? What if there's no binary object? I don't
see
> why
> > we
> > > > >>> should keep the user from usual cache gets in this case.
I really
> > > like
> > > > >>> the idea of supplying the values of distinct fields, thus
freeing
> > the
> > > > >>> user of the need to mess with objects and builders, AND then
just
> > > > >>> calculating hash code as suggested before - say, via explicitly
> > > > >>> listing participating fields in XML or by marking them with
> > transient
> > > > >>> keyword or some annotation.
> > > > >>> Actually, I believe that's the only case when we need to
generate
> > any
> > > > >>> hash codes - when the class is present, we can just get hash
code
> > > from
> > > > >>> its implementation of its method. When there's no class,
we
> > generate.
> > > > >>> And all that is solely for SQL. For the rest - just throw
an
> > > exception
> > > > >>> when there's no hash code manually set for binary object.
I don't
> > see
> > > > >>> why we should try to generate anything when the user already
is
> > using
> > > > >>> Ignite in full, not just via limited interface of SQL.
> > > > >>>
> > > > >>> 2016-09-29 0:31 GMT+03:00 Denis Magda <dmagda@gridgain.com>:
> > > > >>>> Hmm, this is a good question.
> > > > >>>>
> > > > >>>> If a user doesn’t provide a _key when an INSERT is
executed for
> me
> > > it
> > > > means that he is not going to use the key later for cache.get/put,
> > > DELETE,
> > > > UPDATE and other possible operation simply because he doesn’t know
> how
> > to
> > > > reconstruct the key back in his code. If he wants to use the primary
> > key
> > > in
> > > > the rest of operations then he must provide it at INSERT time.
> > > > >>>>
> > > > >>>> Do we need this key only for a case when an object is
being
> > inserted
> > > > into a cache? If it’s so I would auto-generate a key using ‘long’
as
> a
> > > key
> > > > type. I do remember that we provided the auto-generation for Spark
> > module
> > > > in a some way that may be useful here.
> > > > >>>>
> > > > >>>> —
> > > > >>>> Denis
> > > > >>>>
> > > > >>>>> On Sep 28, 2016, at 9:53 AM, Alexander Paschenko
<
> > > > alexander.a.paschenko@gmail.com> wrote:
> > > > >>>>>
> > > > >>>>> Denis,
> > > > >>>>>
> > > > >>>>> That's not what I was asking about.
> > > > >>>>> Currently DML implementation allows for dymanic instantiation
> of
> > > > keys,
> > > > >>>>> in other words, user does not have to provide value
for
> > > object-typed
> > > > >>>>> _key column - instead, he may supply just field values
based on
> > > which
> > > > >>>>> _key will be dynamically instantiated/binary built.
And that's
> > the
> > > > >>>>> whole point of this discussion as I see it: what
to do when
> we've
> > > > >>>>> binary built classless key that we build ourselves
inside SQL
> > > engine
> > > > >>>>> and don't know how to compute hash code for it?
> > > > >>>>>
> > > > >>>>> - Alex
> > > > >>>>>
> > > > >>>>> 2016-09-28 19:48 GMT+03:00 Denis Magda <dmagda@gridgain.com>:
> > > > >>>>>> Alexander,
> > > > >>>>>>
> > > > >>>>>> As I guess if we have a key without a class then
it will be
> > > > constructed using a BinaryBuilder instance and it’s user
> responsibility
> > > to
> > > > set the has code at the end with BinaryBuilder.hasCode method. Sure,
> > all
> > > > this cases must be well-documented in both Java Doc API and Apache
> > Ignite
> > > > documentation.
> > > > >>>>>>
> > > > >>>>>> —
> > > > >>>>>> Denis
> > > > >>>>>>
> > > > >>>>>>> On Sep 28, 2016, at 9:33 AM, Alexander Paschenko
<
> > > > alexander.a.paschenko@gmail.com> wrote:
> > > > >>>>>>>
> > > > >>>>>>> Dmitry, Denis,
> > > > >>>>>>>
> > > > >>>>>>> OK, but I think it's necessary to address
also the cases when
> > > > there's
> > > > >>>>>>> no actual class for the key, and its fields
are simply
> declared
> > > in
> > > > >>>>>>> XML. In this case, there are no fields to
be marked
> transient.
> > > > What do
> > > > >>>>>>> we do then? List transient fields in XML
separately?
> > > > >>>>>>>
> > > > >>>>>>> - Alex
> > > > >>>>>>>
> > > > >>>>>>> 2016-09-28 4:15 GMT+03:00 Dmitriy Setrakyan
<
> > > dsetrakyan@apache.org
> > > > >:
> > > > >>>>>>>> Agree with Denis.
> > > > >>>>>>>>
> > > > >>>>>>>> - by default, all non-transient key fields
should
> participate
> > in
> > > > the
> > > > >>>>>>>> hashcode generation
> > > > >>>>>>>> - when working on DDL, then the primary
key fields should
> > > > participate in
> > > > >>>>>>>> the hashcode
> > > > >>>>>>>> - we should add a resolver to override
the default behavior
> > > > (please
> > > > >>>>>>>> propose the interface in Jira)
> > > > >>>>>>>> - we should print out a warning, only
once per type, the the
> > > > hashcode
> > > > >>>>>>>> has been automatically generated based
on which fields and
> > which
> > > > formula
> > > > >>>>>>>>
> > > > >>>>>>>> D.
> > > > >>>>>>>>
> > > > >>>>>>>> On Tue, Sep 27, 2016 at 5:42 PM, Denis
Magda <
> > > dmagda@gridgain.com>
> > > > wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Hi Alexander,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Vladimir’s proposal sounds reasonable
to me. However we
> must
> > > > keep in mind
> > > > >>>>>>>>> one important thing. Binary objects
were designed to
> address
> > > the
> > > > following
> > > > >>>>>>>>> disadvantages a regular serializer,
like optimized
> > marshaller,
> > > > has:
> > > > >>>>>>>>> necessity to deserialize an object
on a server side every
> > time
> > > > it’s needed.
> > > > >>>>>>>>> necessity to hold an object in both
serialized and
> > deserialized
> > > > forms on
> > > > >>>>>>>>> the server node.
> > > > >>>>>>>>> necessity to restart the whole cluster
each time an object
> > > > version is
> > > > >>>>>>>>> changed (new field is added or an
old one is removed).
> > > > >>>>>>>>> If it will be needed to perform step
3 for a default
> > > > implementation of the
> > > > >>>>>>>>> binary resolver just because the
resolver has to consider
> new
> > > > fields or
> > > > >>>>>>>>> ignore old ones then such an implementation
sucks. Overall,
> > the
> > > > default
> > > > >>>>>>>>> implementation should use the reflection
coming over all
> the
> > > > fields a key
> > > > >>>>>>>>> has ignoring the ones that are marked
with “transient”
> > keyword.
> > > > If a user
> > > > >>>>>>>>> wants to control the default resolver's
logic then he can
> > label
> > > > all the
> > > > >>>>>>>>> fields that mustn’t be of a final
has code value with
> > > > “transient”. This has
> > > > >>>>>>>>> to be well-documented for sure.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Makes sense?
> > > > >>>>>>>>>
> > > > >>>>>>>>> —
> > > > >>>>>>>>> Denis
> > > > >>>>>>>>>
> > > > >>>>>>>>>> On Sep 26, 2016, at 12:40 PM,
Alexander Paschenko <
> > > > >>>>>>>>> alexander.a.paschenko@gmail.com>
wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Hello Igniters,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> As DML support is near, it's
critical that we agree on how
> > we
> > > > generate
> > > > >>>>>>>>>> hash codes for new keys in presence
of binary marshaller.
> > > > Actually,
> > > > >>>>>>>>>> this discussion isn't new - please
see its beginning here:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> http://apache-ignite-developers.2346864.n4.nabble.
> > > > >>>>>>>>> com/All-BinaryObjects-created-
> by-BinaryObjectBuilder-stored-
> > > > >>>>>>>>> at-the-same-partition-by-default-td8042.html
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Still, I'm creating this new
thread to make getting to the
> > > final
> > > > >>>>>>>>>> solution as simple and fast as
possible.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I remind everyone that the approach
that has got the least
> > > > critics was
> > > > >>>>>>>>>> the one proposed by Vladimir
Ozerov:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> <quote>
> > > > >>>>>>>>>> I think we can do the following:
> > > > >>>>>>>>>> 1) Add "has hash code" flag as
Denis suggested.
> > > > >>>>>>>>>> 2) If object without a hash code
is put to cache, throw an
> > > > exception.
> > > > >>>>>>>>>> 3) Add *BinaryEqualsHashCodeResolver
*interface.
> > > > >>>>>>>>>> 4) Add default implementation
which will auto-generate
> hash
> > > > code. *Print
> > > > >>>>>>>>> a
> > > > >>>>>>>>>> warning when auto-generation
occurs*, so that user is
> aware
> > > > that he is
> > > > >>>>>>>>>> likely to have problems with
normal GETs/PUTs.
> > > > >>>>>>>>>> 5) Add another implementation
which will use encoded
> string
> > to
> > > > calculate
> > > > >>>>>>>>> a
> > > > >>>>>>>>>> hash code. E.g. *new BinaryEqualsHashCodeResolver("{a}
*
> > 31 +
> > > > {b}")*.
> > > > >>>>>>>>>> Originally proposed by Yakov
some time ago.
> > > > >>>>>>>>>> </quote>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> After that, Sergi suggested that
instead of a "formula" we
> > > keep
> > > > just a
> > > > >>>>>>>>>> list of the "fields" that participate
in hash code
> > evaluation,
> > > > and
> > > > >>>>>>>>>> with that list, we simply calculate
hash code just like
> IDE
> > > > does -
> > > > >>>>>>>>>> with all its bit shifts and additions.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I'm planning on settling down
with this combined
> Vlad-Sergi
> > > > approach.
> > > > >>>>>>>>>> Any objections?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> And an extra question I have:
Vlad, you suggest that we
> both
> > > > throw an
> > > > >>>>>>>>>> exception on cache code absence
and that we might generate
> > it
> > > > as the
> > > > >>>>>>>>>> last resort. Do I understand
you correctly that you
> suggest
> > > > generating
> > > > >>>>>>>>>> random code only in context of
SQL, but throw exception
> for
> > > keys
> > > > >>>>>>>>>> without codes on ordinary put?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> And yes, built-in hash codes
for JDK types are supported
> as
> > > > well as
> > > > >>>>>>>>>> items 1-2 from Vlad's list (there's
already fixed issue of
> > > > IGNITE-3633
> > > > >>>>>>>>>> for the flag and its presence
check).
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> - Alex
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message