From dev-return-42761-archive-asf-public=cust-asf.ponee.io@ignite.apache.org Sun Dec 2 15:53:21 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id A732A180609 for ; Sun, 2 Dec 2018 15:53:20 +0100 (CET) Received: (qmail 5428 invoked by uid 500); 2 Dec 2018 14:53:19 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 5411 invoked by uid 99); 2 Dec 2018 14:53:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Dec 2018 14:53:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A913EC035A for ; Sun, 2 Dec 2018 14:53:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.153 X-Spam-Level: * X-Spam-Status: No, score=1.153 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_REPLY=1, FROM_EXCESS_BASE64=0.105, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id ubRDnHq80QFf for ; Sun, 2 Dec 2018 14:53:16 +0000 (UTC) Received: from mail-ot1-f41.google.com (mail-ot1-f41.google.com [209.85.210.41]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 2B15E610DD for ; Sun, 2 Dec 2018 14:53:16 +0000 (UTC) Received: by mail-ot1-f41.google.com with SMTP id t5so9366798otk.1 for ; Sun, 02 Dec 2018 06:53:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=fSOzhO5mJ0MCXAuZ23IkYeW4pXALc/EHIFLGk20PgD8=; b=TE7C4psqlQ1ziaPnGLYbmU4WHsx2KwNZxnvOSbDWa3BJuC1YIyCuvAPUL3tMsrQO12 3rb1qxPJRfuoJNYTuffvVKBNW8JhpdUHhwZ9fQ3aG0VuQo//b5L951ElHg1fFHjGwUnV SlW1N9ZElT3l28fyoWkoX/CMoIvYm8wlR6/h01bMOAbWxvFS0hsfWZIb+MktHO5s4df6 HqKMAEM1+UX2eImz9+CPrtQ8/U5Pv5xkOkmJYK5VII5oru76JOKXQnJNjsNAbm+g5OBp do6FhgOoRJ6KvbmhJDqbj2CO/y9WhCP+UTlGJ8RnEbwnCptc/RhRF+JM7oTMjeOztQET gwbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=fSOzhO5mJ0MCXAuZ23IkYeW4pXALc/EHIFLGk20PgD8=; b=VD5a3mZYFx54R49B9M6nvpeVKsYRLDKMun+HrtzB6EtltM+3OkkB5qB666hEUaKpxG iPZ8Juozi9G05NDGflM5pB/dHRctRYLRNx7MGrTjCPoKTrh0fipfA1y7Ah6OW/3yQ03D wqlQF7G82xBidfOHcKqivPgL1ck9MBeseaGbwoSgbQxa1XIxNkMe7GgN0F8aYWaT3jYz Liin7NdYQRHI7LRq5tNIEY397qmr5CkspduZnvqcYgNALFv6GCvy0ikj+2fkVRH7Sy1M Ah7Ge+JlGwGwwN+0txMk1+SlyTIYuykD8gjqW4RzsPfRpr+M2+Q/ZAD/nr7qfDyGEQOJ +ahg== X-Gm-Message-State: AA+aEWYPaLhknp3/NeO7MnF6FgJjasn11JEeGt9DcgaT57Jr2JzLm69M x2wSSlpLpFHPScKY/d9/dfCjS2joxkfqa9qC3qh5xBOL X-Google-Smtp-Source: AFSGD/WlSxQ/qbm7yCLgeQFKeumHORtCAB+4D3ghGXOUdNuuyyviip4ovby4I54TSeLQweED+TdbwkuBDYRh+CzIi0Q= X-Received: by 2002:a9d:3ba5:: with SMTP id k34mr8292140otc.364.1543762394652; Sun, 02 Dec 2018 06:53:14 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?B?0J/QsNCy0LvRg9GF0LjQvSDQmNCy0LDQvQ==?= Date: Sun, 2 Dec 2018 17:53:01 +0300 Message-ID: Subject: Re: Historical rebalance To: dev@ignite.apache.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Igor, You wrote: > different transactions may be applied in different order on backup nodes. > That's why we need an active tx set How then does it work for other cache modes? Can we receive an inconsistent data after a historical rebalace for TRANSACTIONAL and ATOMIC caches? If inconsistency is possible then we might be able to fix historical rebalance for other cache modes as well. Main point here is that having only one common mechanism is much better. Of course if it is feasible. =D0=BF=D1=82, 30 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. =D0=B2 02:02, Seliv= erstov Igor : > > Vladimir, > > Look at my example: > > One active transaction (Tx1 which does opX ops) while another tx (Tx2 whi= ch > does opX' ops) is finishes with uc4: > > ----uc1--op1----op2---uc2--op1'----uc3--uc4---op3------------X-----------= -- > Node1 > > > > ----uc1----op1----uc2----op2----uc3--op3----------uc4----cp1---- Tx1 = - > ^ | | > | > ------------------------ > | -Node2 > ^------ > | > | > | > ----uc1-------------uc2-------------uc3--------op1'----uc4----cp1---- > Tx2 - > > > state on Node2: tx1 -> op3 -> uc2 > cp1 [current=3Duc4, backpointer=3Duc2] > > Here op2 was acknowledged by op3, op3 was applied before op1' (linearized > by WAL). > > All nodes having uc4 must have op1' because uc4 cannot be get earlier tha= n > prepare stage while prepare stage happens after all updates so *op1' > happens before uc4* regardless Tx2 was committed or rolled back. > > This means *op2 happens before uc4* (uc4 cannot be earlier op2 on any nod= e > because on Node2 op2 was already finished (acknowledged by op3) when op1' > happens) > > That was my idea which easy to proof. > > You used a different approach, but yes, It has to work. > > =D1=87=D1=82, 29 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. =D0=B2 22:19, Vla= dimir Ozerov : > > > "If more recent WAL records will contain *ALL* updates of the transacti= on" > > -> "More recent WAL records will contain *ALL* updates of the transacti= on" > > > > On Thu, Nov 29, 2018 at 10:15 PM Vladimir Ozerov > > wrote: > > > > > Igor, > > > > > > Yes, I tried to draw different configurations, and it really seems to > > > work, despite of being very hard to proof due to non-inituitive HB ed= ges. > > > So let me try to spell the algorithm once again to make sure that we = are > > on > > > the same page here. > > > > > > 1) There are two nodes - primary (P) and backup (B) > > > 2) There are three type of events: small transactions which possibly > > > increments update counter (ucX), one long active transaction which is > > split > > > into multiple operations (opX), and checkpoints (cpX) > > > 3) Every node always has current update counter. When transaction com= mits > > > it may or may not shift this counter further depending on whether the= re > > are > > > holes behind. But we have a strict rule that it always grow. Higher > > > coutners synchrnoizes with smaller. Possible cases: > > > ----uc1----uc2----uc3---- > > > ----uc1--------uc3------- // uc2 missing due to reorder, but is is ok > > > > > > 4) Operations within a single transaction is always applied sequentia= lly, > > > and hence also have HB edge: > > > ----op1----op2----op3---- > > > > > > 5) When transaction operation happens, we save in memory current upda= te > > > counter available at this moment. I.e. we have a map from transaction= ID > > to > > > update counter which was relevant by the time last *completed* operat= ion > > > *started*. This is very important thing - we remember the counter whe= n > > > operation starts, but update the map only when it finishes. This is > > needed > > > for situation when update counter is bumber in the middle of a long > > > operation. > > > ----uc1----op1----op2----uc2----uc3----op3---- > > > | | | > > > uc1 uc1 uc3 > > > > > > state: tx1 -> op3 -> uc3 > > > > > > 6) Whenever checkpoint occurs, we save two counters with: "current" a= nd > > > "backpointer". The latter is the smallest update counter associated w= ith > > > active transactions. If there are no active transactions, current upd= ate > > > counter is used. > > > > > > Example 1: no active transactions. > > > ----uc1----cp1---- > > > ^ | > > > -------- > > > > > > state: cp1 [current=3Duc1, backpointer=3Duc1] > > > > > > Example 2: one active transaction: > > > --------------- > > > | | > > > ----uc1----op1----uc2----op2----op3----uc3----cp1---- > > > ^ | > > > -------------- > > > > > > state: tx1 -> op3 -> uc2 > > > cp1 [current=3Duc3, backpointer=3Duc2] > > > > > > 7) Historical rebalance: > > > 7.1) Demander finds latest checkpoint, get it's backpointer and sends= it > > > to supplier. > > > 7.2) Supplier finds earliest checkpoint where [supplier(current) <=3D > > > demander(backpointer)] > > > 7.3) Supplier reads checkpoint backpointer and finds associated WAL > > > record. This is where we start. > > > > > > So in terms of WAL we have: supplier[uc_backpointer <- cp(uc_current = <=3D > > > demanter_uc_backpointer)] <- demander[uc_backpointer <- cp(last)] > > > > > > Now the most important - why it works :-) > > > 1) Transaction opeartions are sequential, so at the time of crash nod= es > > > are *at most one operation ahead *each other > > > 2) Demander goes to the past and finds update counter which was curre= nt > > at > > > the time of last TX completed operation > > > 3) Supplier goes to the closest checkpoint in the past where this upd= ate > > > counter either doesn't exist or just appeared > > > 4) Transaction cannot be committed on supplier at this checkpoint, as= it > > > would violate UC happens-before rule > > > 5) Tranasction may have not started yet on supplier at this point. If > > more > > > recent WAL records will contain *ALL* updates of the transaction > > > 6) Transaction may exist on supplier at this checkpoint. Thanks to p.= 1 we > > > must skip at most one operation. Jump back through supplier's checkpo= int > > > backpointer is guaranteed to do this. > > > > > > Igor, do we have the same understanding here? > > > > > > Vladimir. > > > > > > On Thu, Nov 29, 2018 at 2:47 PM Seliverstov Igor > > > wrote: > > > > > >> Ivan, > > >> > > >> different transactions may be applied in different order on backup > > nodes. > > >> That's why we need an active tx set > > >> and some sorting by their update times. The idea is to identify a po= int > > in > > >> time which starting from we may lost some updates. > > >> This point: > > >> 1) is the last acknowledged by all backups (including possible > > further > > >> demander) update on timeline; > > >> 2) have a specific update counter (aka back-counter) which we goi= ng > > to > > >> start iteration from. > > >> > > >> After additional thinking on, I've identified a rule: > > >> > > >> There is two fences: > > >> 1) update counter (UC) - this means that all updates, with less UC > > than > > >> applied one, was applied on a node, having this UC. > > >> 2) update in scope of TX - all updates are applied one by one > > >> sequentially, this means that the fact of update guaranties the prev= ious > > >> update (statement) was finished on all TX participants. > > >> > > >> =D0=A1ombining them, we can say the next: > > >> > > >> All updates, that was acknowledged at the time the last update of tx= , > > >> which > > >> updated UC, applied, are guaranteed to be presented on a node having > > such > > >> UC > > >> > > >> We can use this rule to find an iterator start pointer. > > >> > > >> =D1=81=D1=80, 28 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. =D0=B2 20:26= , =D0=9F=D0=B0=D0=B2=D0=BB=D1=83=D1=85=D0=B8=D0=BD =D0=98=D0=B2=D0=B0=D0=BD= : > > >> > > >> > Guys, > > >> > > > >> > Another one idea. We can introduce additional update counter which= is > > >> > incremented by MVCC transactions right after executing operation (= like > > >> > is done for classic transactions). And we can use that counter for > > >> > searching needed WAL records. Can it did the trick? > > >> > > > >> > P.S. Mentally I am trying to separate facilities providing > > >> > transactions and durability. And it seems to me that those facilit= ies > > >> > are in different dimensions. > > >> > =D1=81=D1=80, 28 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. =D0=B2 16:= 26, =D0=9F=D0=B0=D0=B2=D0=BB=D1=83=D1=85=D0=B8=D0=BD =D0=98=D0=B2=D0=B0=D0= =BD : > > >> > > > > >> > > Sorry, if it was stated that a SINGLE transaction updates are > > applied > > >> > > in a same order on all replicas then I have no questions so far.= I > > >> > > thought about reordering updates coming from different transacti= ons. > > >> > > > I have not got why we can assume that reordering is not possib= le. > > >> What > > >> > > have I missed? > > >> > > =D1=81=D1=80, 28 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. =D0=B2 1= 3:26, =D0=9F=D0=B0=D0=B2=D0=BB=D1=83=D1=85=D0=B8=D0=BD =D0=98=D0=B2=D0=B0= =D0=BD : > > >> > > > > > >> > > > Hi, > > >> > > > > > >> > > > Regarding Vladimir's new idea. > > >> > > > > We assume that transaction can be represented as a set of > > >> > independent operations, which are applied in the same order on bot= h > > >> primary > > >> > and backup nodes. > > >> > > > I have not got why we can assume that reordering is not possib= le. > > >> What > > >> > > > have I missed? > > >> > > > =D0=B2=D1=82, 27 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. =D0=B2= 14:42, Seliverstov Igor < > > >> gvvinblade@gmail.com>: > > >> > > > > > > >> > > > > Vladimir, > > >> > > > > > > >> > > > > I think I got your point, > > >> > > > > > > >> > > > > It should work if we do the next: > > >> > > > > introduce two structures: active list (txs) and candidate li= st > > >> > (updCntr -> > > >> > > > > txn pairs) > > >> > > > > > > >> > > > > Track active txs, mapping them to actual update counter at > > update > > >> > time. > > >> > > > > On each next update put update counter, associated with prev= ious > > >> > update, > > >> > > > > into a candidates list possibly overwrite existing value > > (checking > > >> > txn) > > >> > > > > On tx finish remove tx from active list only if appropriate > > update > > >> > counter > > >> > > > > (associated with finished tx) is applied. > > >> > > > > On update counter update set the minimal update counter from= the > > >> > candidates > > >> > > > > list as a back-counter, clear the candidate list and remove = an > > >> > associated > > >> > > > > tx from the active list if present. > > >> > > > > Use back-counter instead of actual update counter in demand > > >> message. > > >> > > > > > > >> > > > > =D0=B2=D1=82, 27 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. =D0= =B2 12:56, Seliverstov Igor < > > >> gvvinblade@gmail.com > > >> > >: > > >> > > > > > > >> > > > > > Ivan, > > >> > > > > > > > >> > > > > > 1) The list is saved on each checkpoint, wholly (all > > >> transactions > > >> > in > > >> > > > > > active state at checkpoint begin). > > >> > > > > > We need whole the list to get oldest transaction because a= fter > > >> > > > > > the previous oldest tx finishes, we need to get the follow= ing > > >> one. > > >> > > > > > > > >> > > > > > 2) I guess there is a description of how persistent storag= e > > >> works > > >> > and how > > >> > > > > > it restores [1] > > >> > > > > > > > >> > > > > > Vladimir, > > >> > > > > > > > >> > > > > > the whole list of what we going to store on checkpoint > > >> (updated): > > >> > > > > > 1) Partition counter low watermark (LWM) > > >> > > > > > 2) WAL pointer of earliest active transaction write to > > partition > > >> > at the > > >> > > > > > time the checkpoint have started > > >> > > > > > 3) List of prepared txs with acquired partition counters > > (which > > >> > were > > >> > > > > > acquired but not applied yet) > > >> > > > > > > > >> > > > > > This way we don't need any additional info in demand messa= ge. > > >> > Start point > > >> > > > > > can be easily determined using stored WAL "back-pointer". > > >> > > > > > > > >> > > > > > [1] > > >> > > > > > > > >> > > > >> > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+St= ore+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProces= s > > >> > > > > > > > >> > > > > > > > >> > > > > > =D0=B2=D1=82, 27 =D0=BD=D0=BE=D1=8F=D0=B1. 2018 =D0=B3. = =D0=B2 11:19, Vladimir Ozerov < > > >> > vozerov@gridgain.com>: > > >> > > > > > > > >> > > > > >> Igor, > > >> > > > > >> > > >> > > > > >> Could you please elaborate - what is the whole set of > > >> information > > >> > we are > > >> > > > > >> going to save at checkpoint time? From what I understand = this > > >> > should be: > > >> > > > > >> 1) List of active transactions with WAL pointers of their > > first > > >> > writes > > >> > > > > >> 2) List of prepared transactions with their update counte= rs > > >> > > > > >> 3) Partition counter low watermark (LWM) - the smallest > > >> partition > > >> > counter > > >> > > > > >> before which there are no prepared transactions. > > >> > > > > >> > > >> > > > > >> And the we send to supplier node a message: "Give me all > > >> updates > > >> > starting > > >> > > > > >> from that LWM plus data for that transactions which were > > active > > >> > when I > > >> > > > > >> failed". > > >> > > > > >> > > >> > > > > >> Am I right? > > >> > > > > >> > > >> > > > > >> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor < > > >> > gvvinblade@gmail.com> > > >> > > > > >> wrote: > > >> > > > > >> > > >> > > > > >> > Hi Igniters, > > >> > > > > >> > > > >> > > > > >> > Currently I=E2=80=99m working on possible approaches ho= w to > > implement > > >> > historical > > >> > > > > >> > rebalance (delta rebalance using WAL iterator) over MVC= C > > >> caches. > > >> > > > > >> > > > >> > > > > >> > The main difficulty is that MVCC writes changes on tx > > active > > >> > phase while > > >> > > > > >> > partition update version, aka update counter, is being > > >> applied > > >> > on tx > > >> > > > > >> > finish. This means we cannot start iteration over WAL r= ight > > >> > from the > > >> > > > > >> > pointer where the update counter updated, but should > > include > > >> > updates, > > >> > > > > >> which > > >> > > > > >> > the transaction that updated the counter did. > > >> > > > > >> > > > >> > > > > >> > These updates may be much earlier than the point where = the > > >> > update > > >> > > > > >> counter > > >> > > > > >> > was updated, so we have to be able to identify the poin= t > > >> where > > >> > the first > > >> > > > > >> > update happened. > > >> > > > > >> > > > >> > > > > >> > The proposed approach includes: > > >> > > > > >> > > > >> > > > > >> > 1) preserve list of active txs, sorted by the time of t= heir > > >> > first update > > >> > > > > >> > (using WAL ptr of first WAL record in tx) > > >> > > > > >> > > > >> > > > > >> > 2) persist this list on each checkpoint (together with > > TxLog > > >> for > > >> > > > > >> example) > > >> > > > > >> > > > >> > > > > >> > 4) send whole active tx list (transactions which were i= n > > >> active > > >> > state at > > >> > > > > >> > the time the node was crushed, empty list in case of > > graceful > > >> > node > > >> > > > > >> stop) as > > >> > > > > >> > a part of partition demand message. > > >> > > > > >> > > > >> > > > > >> > 4) find a checkpoint where the earliest tx exists in > > >> persisted > > >> > txs and > > >> > > > > >> use > > >> > > > > >> > saved WAL ptr as a start point or apply current approac= h in > > >> > case the > > >> > > > > >> active > > >> > > > > >> > tx list (sent on previous step) is empty > > >> > > > > >> > > > >> > > > > >> > 5) start iteration. > > >> > > > > >> > > > >> > > > > >> > Your thoughts? > > >> > > > > >> > > > >> > > > > >> > Regards, > > >> > > > > >> > Igor > > >> > > > > >> > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > -- > > >> > > > Best regards, > > >> > > > Ivan Pavlukhin > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > Best regards, > > >> > > Ivan Pavlukhin > > >> > > > >> > > > >> > > > >> > -- > > >> > Best regards, > > >> > Ivan Pavlukhin > > >> > > > >> > > > > > --=20 Best regards, Ivan Pavlukhin