Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ignite.apache.org
MIME-Version: 1.0
In-Reply-To: <CAGt6o8mEVxgiCgOKtoQAaWtDqJBmmgBKzGvmw8jCcLZDCzNt5A@mail.gmail.com>
References: <CAGt6o8kDr3UOxkMosPbHCbJM+nVgVaet4h8g_s4M4EVktUeWpg@mail.gmail.com>
 <FD2E142F-F995-41B7-8EAD-45A2D4408A64@apache.org> <CAGt6o8k_E1BMuRMVKMmmswDX0=ETsqt4KsmZaETEcsZOLpfV2w@mail.gmail.com>
 <6DDDDBA0-8E08-4AA9-B6BC-7D72CD337026@apache.org> <CABDss3i0R1tKJUtfqxiixhac40g0eY7tpHJ50HhhiHDTVv-h5g@mail.gmail.com>
 <CAGt6o8mEVxgiCgOKtoQAaWtDqJBmmgBKzGvmw8jCcLZDCzNt5A@mail.gmail.com>
From: Vladimir Ozerov <vozerov@gridgain.com>
Date: Fri, 15 Dec 2017 16:59:20 +0300
Message-ID: <CAGt6o8nUsJVaAtwquXRfN94_enTrK7MvTf_z9-pAh1_i+a=50w@mail.gmail.com>
Subject: Re: Rework locking architecture for MVCC and transactional SQL
To: dev@ignite.apache.org
Content-Type: multipart/alternative; boundary="001a113e2d444078da0560616833"
archived-at: Fri, 15 Dec 2017 13:59:34 -0000

--001a113e2d444078da0560616833
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Well, there is completely different approach to handle the problem - lock
escalation. E.g. SQL Server typically require 96 bytes per lock. Much less
than Ignite, but still. And as soon as some query require more than 5000
row locks, it is escalated to exclusive table lock. Not only this
eliminates memory consumption problem, it increases performance of massive
updates dramatically.
On the other hand this approach is more prone to deadlocks, since
transactions updating disjoint data sets now have shared resources to be
locked.

There is no silver bullet apparently.

On Fri, Dec 15, 2017 at 4:42 PM, Vladimir Ozerov <vozerov@gridgain.com>
wrote:

> Alex,
>
> That might be very good idea. In fact, what you describe closely resemble=
s
> TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally logg=
ed
> and purged on restart. Ideally we could try designing this component in
> generic way, so that it could store a lot different temporal stuff:
> 1) Locks
> 2) UNDO data
> 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever)
> 4) If needed - visibility info (e.g. for covering indexes and purge/vacuu=
m)
>
> WDYT?
>
> Vladimir.
>
> [1] https://docs.microsoft.com/en-us/sql/relational-
> databases/databases/tempdb-database
>
> On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk <
> alexey.goncharuk@gmail.com> wrote:
>
>> Vladimir,
>>
>> What about moving the entire locking mechanism to a separate off-heap
>> memory region which will be volatile wrt restarts, but will still suppor=
t
>> off-load to disk. In the current architecture, it means that we will nee=
d
>> to allocate a separate DataRegion with no WAL and no crash recovery -
>> locks
>> are meaningless after a restart, and we will automatically drop them. I
>> would be interesting to prototype this because I think we may be on-par
>> with on-heap lock placement, as we already proved for in-memory caches.
>>
>> 2017-12-14 21:53 GMT+03:00 Denis Magda <dmagda@apache.org>:
>>
>> > Vladimir,
>> >
>> > No it=E2=80=99s crystal clear, thanks.
>> >
>> > If this approach works only for Ignite persistence based deployment, h=
ow
>> > will we handle locking for pure in-memory and caching of 3rd party
>> > databases scenarios? As I understand the tuples still will be stored i=
n
>> the
>> > page memory while there won=E2=80=99t be any opportunity to fallback t=
o disk if
>> the
>> > memory usage increases some threshold.
>> >
>> > =E2=80=94
>> > Denis
>> >
>> > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <vozerov@gridgain.com>
>> > wrote:
>> > >
>> > > Denis,
>> > >
>> > > Sorry, may be I was not clear enough - "tuple-approach" and
>> "persistent
>> > > approach" are the same. By "tuple" I mean a row stored inside a data
>> > block.
>> > > Currently we store lock information in Java heap and proposal is to
>> move
>> > it
>> > > to data blocks. The main driver is memory - if there are a rows to b=
e
>> > > locked we will either run out of memory, or produce serious memory
>> > > pressure. For example, currently update of 1M entries will consume
>> ~500Mb
>> > > of heap. With proposed approach it will consume almost nothing. The
>> > > drawback is increased number of dirty data pages, but it should not
>> be a
>> > > problem because in final implementation we will update data rows
>> before
>> > > prepare phase anyway, so I do not expect any write amplification in
>> usual
>> > > case.
>> > >
>> > > This approach is only applicable for Ignite persistence.
>> > >
>> > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <dmagda@apache.org>
>> wrote:
>> > >
>> > >> Vladimir,
>> > >>
>> > >> Thanks for a throughout overview and proposal.
>> > >>
>> > >>> Also we could try employing tiered approach
>> > >>> 1) Try to keep everything in-memory to minimize writes to blocks
>> > >>> 2) Fallback to persistent lock data if certain threshold is reache=
d.
>> > >>
>> > >> What are the benefits of the backed-by-persistence approach in
>> compare
>> > to
>> > >> the one based on tuples? Specifically:
>> > >> - will the persistence approach work for both 3rd party and Ignite
>> > >> persistence?
>> > >> - any performance impacts depending on a chosen method?
>> > >> - what=E2=80=99s faster to implement?
>> > >>
>> > >> =E2=80=94
>> > >> Denis
>> > >>
>> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <vozerov@gridgain.com=
>
>> > >> wrote:
>> > >>>
>> > >>> Igniters,
>> > >>>
>> > >>> As you probably we know we work actively on MVCC [1] and
>> transactional
>> > >> SQL
>> > >>> [2] features which could be treated as a single huge improvement. =
We
>> > >> face a
>> > >>> number of challenges and one of them is locking.
>> > >>>
>> > >>> At the moment information about all locks is kept in memory on
>> > per-entry
>> > >>> basis (see GridCacheMvccManager). For every locked key we maintain
>> > >> current
>> > >>> lock owner (XID) and the list of would-be-owner transactions. When
>> > >>> transaction is about to lock an entry two scenarios are possible:
>> > >>> 1) If entry is not locked we obtain the lock immediately
>> > >>> 2) if entry is locked we add current transaction to the wait list
>> and
>> > >> jumps
>> > >>> to the next entry to be locked. Once the first entry is released b=
y
>> > >>> conflicting transaction, current transaction becomes an owner of t=
he
>> > >> first
>> > >>> entry and tries to promote itself for subsequent entries.
>> > >>>
>> > >>> Once all required locks are obtained, response is sent to the
>> caller.
>> > >>>
>> > >>> This approach doesn't work well for transactional SQL - if we upda=
te
>> > >>> millions of rows in a single transaction we will simply run out of
>> > >> memory.
>> > >>> To mitigate the problem other database vendors keep information
>> about
>> > >> locks
>> > >>> inside the tuples. I propose to apply the similar design as follow=
s:
>> > >>>
>> > >>> 1) No per-entry lock information is stored in memory anymore.
>> > >>> 2) The list of active transactions are maintained in memory still
>> > >>> 3) When TX locks an entry, it sets special marker to the tuple [3]
>> > >>> 4) When TX meets already locked entry, it enlists itself to wait
>> queue
>> > of
>> > >>> conflicting transaction and suspends
>> > >>> 5) When first transaction releases conflicting lock, it notifies a=
nd
>> > >> wakes
>> > >>> up suspended transactions, so they resume locking
>> > >>> 6) Entry lock data is cleared on transaction commit
>> > >>> 7) Entry lock data is not cleared on rollback or node restart;
>> Instead,
>> > >> we
>> > >>> will could use active transactions list to identify invalid locks
>> and
>> > >>> overwrite them as needed.
>> > >>>
>> > >>> Also we could try employing tiered approach
>> > >>> 1) Try to keep everything in-memory to minimize writes to blocks
>> > >>> 2) Fallback to persistent lock data if certain threshold is reache=
d.
>> > >>>
>> > >>> Thoughts?
>> > >>>
>> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478
>> > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191
>> > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo
>> > >> vectors,
>> > >>> per-block transaction lists, etc..
>> > >>>
>> > >>> Vladimir.
>> > >>
>> > >>
>> >
>> >
>>
>
>

--001a113e2d444078da0560616833--