Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 42F42200D5C for ; Fri, 15 Dec 2017 14:59:34 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 418B5160C14; Fri, 15 Dec 2017 13:59:34 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 320E2160C06 for ; Fri, 15 Dec 2017 14:59:33 +0100 (CET) Received: (qmail 7134 invoked by uid 500); 15 Dec 2017 13:59:32 -0000 Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@ignite.apache.org Delivered-To: mailing list dev@ignite.apache.org Received: (qmail 7122 invoked by uid 99); 15 Dec 2017 13:59:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Dec 2017 13:59:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 76D021A0EE9 for ; Fri, 15 Dec 2017 13:59:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.479 X-Spam-Level: ** X-Spam-Status: No, score=2.479 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gridgain-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 75uNg74nuoXj for ; Fri, 15 Dec 2017 13:59:28 +0000 (UTC) Received: from mail-ua0-f169.google.com (mail-ua0-f169.google.com [209.85.217.169]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 02F9B5F33D for ; Fri, 15 Dec 2017 13:59:27 +0000 (UTC) Received: by mail-ua0-f169.google.com with SMTP id j14so6237368uag.11 for ; Fri, 15 Dec 2017 05:59:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gridgain-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=yGtGfvpBArobr/PkV+tKmh6JA1v1hw8IiD/bWICXpRQ=; b=Pfe3LeMSs9vTgeTJCZ6VXOrYzZ1kfFL7CouAisAx8P86tHbqxks+UyB2r12Xj8UW9O JkJUZbrUF4XTwMxs8LbCoSwFJJ7oNKaV6jO8rKZtDDQOTnl2c660Sh2ksAtilBlGcfdg h9o6bKSH1I9sVtlTUGp7jTtMjes9Ljdqhf5dEEMsWYuMpuSAEdAK7Fuz+UX3HoUZwpVo pCRxdJjDnZhzFNW+S1Ke/rrKG8eIAt1dBd/fCsjIkRVgNof+J9HErPZqBwUEPtm1Qxg+ B1JoP7d2r/hAI8zwBROHRl1zGAA5s3SBEkOk0QGnEVptNJpKb/W+wVMkHeUez/6JeThX Fcjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=yGtGfvpBArobr/PkV+tKmh6JA1v1hw8IiD/bWICXpRQ=; b=VS+m/+TTIgrUj1+1cwXM7EBsYpmO3Iq5XlNpucf376XPZKRUrwHOPKT9N1eFGJlu3/ EqGW6Wq9+H67sFmWkVM93LwSEF6mBvfsQZ1+SzC4L44O9eCy0Eu1lccSFOh0PJnHXNEj EUr4UaYx+Ly6x8FNlm3cGItKq/qUdWfX6Mi4KpylTYRJHDIJrKE8/Vd70xsN8uv3E2MD 49XrUzuAXWzQHS+FnDLM9W0HiqEO80Rc7CdyMoyJVYJxPbZATK6aseOx2EbLpnzsx43u oC5GGwg3XSWMP898BJhGFTx4ZSkvIAtL8UUNvecwWZ9wW4MaLBoFzwtgsBKn+e5Ikdv+ gQKA== X-Gm-Message-State: AKGB3mIXoia4qRZb6VaV6jBIe1hc3CKscDUL2OFcnHrDg4BCSqtisZtb PFNeYoZS3+0MjgS29zmqPkwbcjUP6gyzk/WSC8/qoA== X-Google-Smtp-Source: ACJfBotsmdfSFrFz4lxn9BDO0R62yrdKUcIh23QIUQZ9jmd+fdutspn89ce92uto6toNcfYcMNX0z4ybCd4zpZKVNVw= X-Received: by 10.159.38.13 with SMTP id 13mr16000091uag.114.1513346360696; Fri, 15 Dec 2017 05:59:20 -0800 (PST) MIME-Version: 1.0 Received: by 10.159.48.150 with HTTP; Fri, 15 Dec 2017 05:59:20 -0800 (PST) In-Reply-To: References: <6DDDDBA0-8E08-4AA9-B6BC-7D72CD337026@apache.org> From: Vladimir Ozerov Date: Fri, 15 Dec 2017 16:59:20 +0300 Message-ID: Subject: Re: Rework locking architecture for MVCC and transactional SQL To: dev@ignite.apache.org Content-Type: multipart/alternative; boundary="001a113e2d444078da0560616833" archived-at: Fri, 15 Dec 2017 13:59:34 -0000 --001a113e2d444078da0560616833 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Well, there is completely different approach to handle the problem - lock escalation. E.g. SQL Server typically require 96 bytes per lock. Much less than Ignite, but still. And as soon as some query require more than 5000 row locks, it is escalated to exclusive table lock. Not only this eliminates memory consumption problem, it increases performance of massive updates dramatically. On the other hand this approach is more prone to deadlocks, since transactions updating disjoint data sets now have shared resources to be locked. There is no silver bullet apparently. On Fri, Dec 15, 2017 at 4:42 PM, Vladimir Ozerov wrote: > Alex, > > That might be very good idea. In fact, what you describe closely resemble= s > TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally logg= ed > and purged on restart. Ideally we could try designing this component in > generic way, so that it could store a lot different temporal stuff: > 1) Locks > 2) UNDO data > 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever) > 4) If needed - visibility info (e.g. for covering indexes and purge/vacuu= m) > > WDYT? > > Vladimir. > > [1] https://docs.microsoft.com/en-us/sql/relational- > databases/databases/tempdb-database > > On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk < > alexey.goncharuk@gmail.com> wrote: > >> Vladimir, >> >> What about moving the entire locking mechanism to a separate off-heap >> memory region which will be volatile wrt restarts, but will still suppor= t >> off-load to disk. In the current architecture, it means that we will nee= d >> to allocate a separate DataRegion with no WAL and no crash recovery - >> locks >> are meaningless after a restart, and we will automatically drop them. I >> would be interesting to prototype this because I think we may be on-par >> with on-heap lock placement, as we already proved for in-memory caches. >> >> 2017-12-14 21:53 GMT+03:00 Denis Magda : >> >> > Vladimir, >> > >> > No it=E2=80=99s crystal clear, thanks. >> > >> > If this approach works only for Ignite persistence based deployment, h= ow >> > will we handle locking for pure in-memory and caching of 3rd party >> > databases scenarios? As I understand the tuples still will be stored i= n >> the >> > page memory while there won=E2=80=99t be any opportunity to fallback t= o disk if >> the >> > memory usage increases some threshold. >> > >> > =E2=80=94 >> > Denis >> > >> > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov >> > wrote: >> > > >> > > Denis, >> > > >> > > Sorry, may be I was not clear enough - "tuple-approach" and >> "persistent >> > > approach" are the same. By "tuple" I mean a row stored inside a data >> > block. >> > > Currently we store lock information in Java heap and proposal is to >> move >> > it >> > > to data blocks. The main driver is memory - if there are a rows to b= e >> > > locked we will either run out of memory, or produce serious memory >> > > pressure. For example, currently update of 1M entries will consume >> ~500Mb >> > > of heap. With proposed approach it will consume almost nothing. The >> > > drawback is increased number of dirty data pages, but it should not >> be a >> > > problem because in final implementation we will update data rows >> before >> > > prepare phase anyway, so I do not expect any write amplification in >> usual >> > > case. >> > > >> > > This approach is only applicable for Ignite persistence. >> > > >> > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda >> wrote: >> > > >> > >> Vladimir, >> > >> >> > >> Thanks for a throughout overview and proposal. >> > >> >> > >>> Also we could try employing tiered approach >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks >> > >>> 2) Fallback to persistent lock data if certain threshold is reache= d. >> > >> >> > >> What are the benefits of the backed-by-persistence approach in >> compare >> > to >> > >> the one based on tuples? Specifically: >> > >> - will the persistence approach work for both 3rd party and Ignite >> > >> persistence? >> > >> - any performance impacts depending on a chosen method? >> > >> - what=E2=80=99s faster to implement? >> > >> >> > >> =E2=80=94 >> > >> Denis >> > >> >> > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov >> > >> wrote: >> > >>> >> > >>> Igniters, >> > >>> >> > >>> As you probably we know we work actively on MVCC [1] and >> transactional >> > >> SQL >> > >>> [2] features which could be treated as a single huge improvement. = We >> > >> face a >> > >>> number of challenges and one of them is locking. >> > >>> >> > >>> At the moment information about all locks is kept in memory on >> > per-entry >> > >>> basis (see GridCacheMvccManager). For every locked key we maintain >> > >> current >> > >>> lock owner (XID) and the list of would-be-owner transactions. When >> > >>> transaction is about to lock an entry two scenarios are possible: >> > >>> 1) If entry is not locked we obtain the lock immediately >> > >>> 2) if entry is locked we add current transaction to the wait list >> and >> > >> jumps >> > >>> to the next entry to be locked. Once the first entry is released b= y >> > >>> conflicting transaction, current transaction becomes an owner of t= he >> > >> first >> > >>> entry and tries to promote itself for subsequent entries. >> > >>> >> > >>> Once all required locks are obtained, response is sent to the >> caller. >> > >>> >> > >>> This approach doesn't work well for transactional SQL - if we upda= te >> > >>> millions of rows in a single transaction we will simply run out of >> > >> memory. >> > >>> To mitigate the problem other database vendors keep information >> about >> > >> locks >> > >>> inside the tuples. I propose to apply the similar design as follow= s: >> > >>> >> > >>> 1) No per-entry lock information is stored in memory anymore. >> > >>> 2) The list of active transactions are maintained in memory still >> > >>> 3) When TX locks an entry, it sets special marker to the tuple [3] >> > >>> 4) When TX meets already locked entry, it enlists itself to wait >> queue >> > of >> > >>> conflicting transaction and suspends >> > >>> 5) When first transaction releases conflicting lock, it notifies a= nd >> > >> wakes >> > >>> up suspended transactions, so they resume locking >> > >>> 6) Entry lock data is cleared on transaction commit >> > >>> 7) Entry lock data is not cleared on rollback or node restart; >> Instead, >> > >> we >> > >>> will could use active transactions list to identify invalid locks >> and >> > >>> overwrite them as needed. >> > >>> >> > >>> Also we could try employing tiered approach >> > >>> 1) Try to keep everything in-memory to minimize writes to blocks >> > >>> 2) Fallback to persistent lock data if certain threshold is reache= d. >> > >>> >> > >>> Thoughts? >> > >>> >> > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478 >> > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191 >> > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo >> > >> vectors, >> > >>> per-block transaction lists, etc.. >> > >>> >> > >>> Vladimir. >> > >> >> > >> >> > >> > >> > > --001a113e2d444078da0560616833--