From issues-return-92365-archive-asf-public=cust-asf.ponee.io@ignite.apache.org  Wed Feb 27 12:09:13 2019
Return-Path: <issues-return-92365-archive-asf-public=cust-asf.ponee.io@ignite.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 81894180608
	for <archive-asf-public@cust-asf.ponee.io>; Wed, 27 Feb 2019 13:09:12 +0100 (CET)
Received: (qmail 41670 invoked by uid 500); 27 Feb 2019 12:09:11 -0000
Mailing-List: contact issues-help@ignite.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@ignite.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@ignite.apache.org>
List-Post: <mailto:issues@ignite.apache.org>
List-Id: <issues.ignite.apache.org>
Reply-To: dev@ignite.apache.org
Delivered-To: mailing list issues@ignite.apache.org
Received: (qmail 41661 invoked by uid 99); 27 Feb 2019 12:09:11 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Feb 2019 12:09:11 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 323E3C8B36
	for <issues@ignite.apache.org>; Wed, 27 Feb 2019 12:09:11 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.301
X-Spam-Level:
X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100]
	autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id L7r2T8zcv3f6 for <issues@ignite.apache.org>;
	Wed, 27 Feb 2019 12:09:09 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 26634623DF
	for <issues@ignite.apache.org>; Wed, 27 Feb 2019 12:03:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 6993BE27EF
	for <issues@ignite.apache.org>; Wed, 27 Feb 2019 12:03:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1CA3B2456B
	for <issues@ignite.apache.org>; Wed, 27 Feb 2019 12:03:00 +0000 (UTC)
Date: Wed, 27 Feb 2019 12:03:00 +0000 (UTC)
From: "Igor Seliverstov (JIRA)" <jira@apache.org>
To: issues@ignite.apache.org
Message-ID: <JIRA.13218312.1551265563000.382145.1551268980113@Atlassian.JIRA>
In-Reply-To: <JIRA.13218312.1551265563000@Atlassian.JIRA>
References: <JIRA.13218312.1551265563000@Atlassian.JIRA> <JIRA.13218312.1551265563136@jira-lw-us.apache.org>
Subject: [jira] [Updated] (IGNITE-11433) MVCC: Link entry versions at the
 Data Store layer.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/IGNITE-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Igor Seliverstov updated IGNITE-11433:
--------------------------------------
    Description: 
At now all tuple versions are placed inside index trees. CacheDataTree is used to link versions each to other (using their order inside an index page).

Despite the fact that this approach is easy to implement and preferable at the first point, it brings several disadvantages:

1) We need to iterate over tuple versions at update time under a read (or even write) lock on an index page which blocks other write (read) operations for a relatively long period of time.
2) We index all tuple versions which makes indexes use much more space than needed
3) We cannot implement several important improvements (data streamer optimizations) because having several versions of one key in an index page doesn't allow using of Invoke operations.
3) Write amplification suffers not only Data Store layer, but indexes as well, which makes read/lookup ops into indexes much slower.

Using versions linking at the Data Store only (like it do other vendors) solves or decreases impact of that issues.

So, the proposed changes:

1) Change data page layout adding two fields into its header: {{link}} (a link to the next tuple in a versions chain) and {{lock}} (a tx, which holds a write lock on the HEAD of the chain) There are several possible optimizations: 1) leave lock as is (in the cache index item) 2) use max version as lock version as well
2) Do not save all versions of a tuple in indexes; this mean removing version from key - newest version will overwrite an existing entry

There are two approaches with some pros and cons of how to link versions:

1) N2O (newer to older) - a reader (writer) gets the newest tuple version first and iterates over tuple versions from newer to older until it gets a position where it's snapshot placed between min and max versions of the examined tuple. Approach implies faster reads (more actual versions are get first) and necessity of updating all involved indexes on each write operation - slower writes in other words (may be optimized using logical pointers to the head of tuple versions chain). Cooperative VAC (update operations remove invisible for all readers tuple versions) is possible.
2) O2N (older to newer) - a reader gets the oldest visible tuple version and iterates over versions until it gets visible version. It allows not to update all indexes (except the case when an index value is changed), write operations become lighter. Cooperative VAC almost impossible.

We need to decide which approach to use depending on that load profile is preferable (OLTP/OLAP)

  was:
At now all tuple versions are placed inside index trees. CacheDataTree is used to link versions each to other (using their order inside a data page).

Despite the fact that this approach is easy to implement and preferable at the first point, it brings several disadvantages:

1) We need to iterate over tuple versions at update time under a read (or even write) lock on an index page which blocks other write (read) operations for a relatively long period of time.
2) We index all tuple versions which makes indexes use much more space than needed
3) We cannot implement several important improvements (data streamer optimizations) because having several versions of one key in an index page doesn't allow using of Invoke operations.
3) Write amplification suffers not only Data Store layer, but indexes as well, which makes read/lookup ops into indexes much slower.

Using versions linking at the Data Store only (like it do other vendors) solves or decreases impact of that issues.

So, the proposed changes:

1) Change data page layout adding two fields into its header: {{link}} (a link to the next tuple in a versions chain) and {{lock}} (a tx, which holds a write lock on the HEAD of the chain) There are several possible optimizations: 1) leave lock as is (in the cache index item) 2) use max version as lock version as well
2) Do not save all versions of a tuple in indexes; this mean removing version from key - newest version will overwrite an existing entry

There are two approaches with some pros and cons of how to link versions:

1) N2O (newer to older) - a reader (writer) gets the newest tuple version first and iterates over tuple versions from newer to older until it gets a position where it's snapshot placed between min and max versions of the examined tuple. Approach implies faster reads (more actual versions are get first) and necessity of updating all involved indexes on each write operation - slower writes in other words (may be optimized using logical pointers to the head of tuple versions chain). Cooperative VAC (update operations remove invisible for all readers tuple versions) is possible.
2) O2N (older to newer) - a reader gets the oldest visible tuple version and iterates over versions until it gets visible version. It allows not to update all indexes (except the case when an index value is changed), write operations become lighter. Cooperative VAC almost impossible.

We need to decide which approach to use depending on that load profile is preferable (OLTP/OLAP)


> MVCC: Link entry versions at the Data Store layer.
> --------------------------------------------------
>
>                 Key: IGNITE-11433
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11433
>             Project: Ignite
>          Issue Type: Improvement
>          Components: mvcc, sql
>            Reporter: Igor Seliverstov
>            Priority: Major
>
> At now all tuple versions are placed inside index trees. CacheDataTree is used to link versions each to other (using their order inside an index page).
> Despite the fact that this approach is easy to implement and preferable at the first point, it brings several disadvantages:
> 1) We need to iterate over tuple versions at update time under a read (or even write) lock on an index page which blocks other write (read) operations for a relatively long period of time.
> 2) We index all tuple versions which makes indexes use much more space than needed
> 3) We cannot implement several important improvements (data streamer optimizations) because having several versions of one key in an index page doesn't allow using of Invoke operations.
> 3) Write amplification suffers not only Data Store layer, but indexes as well, which makes read/lookup ops into indexes much slower.
> Using versions linking at the Data Store only (like it do other vendors) solves or decreases impact of that issues.
> So, the proposed changes:
> 1) Change data page layout adding two fields into its header: {{link}} (a link to the next tuple in a versions chain) and {{lock}} (a tx, which holds a write lock on the HEAD of the chain) There are several possible optimizations: 1) leave lock as is (in the cache index item) 2) use max version as lock version as well
> 2) Do not save all versions of a tuple in indexes; this mean removing version from key - newest version will overwrite an existing entry
> There are two approaches with some pros and cons of how to link versions:
> 1) N2O (newer to older) - a reader (writer) gets the newest tuple version first and iterates over tuple versions from newer to older until it gets a position where it's snapshot placed between min and max versions of the examined tuple. Approach implies faster reads (more actual versions are get first) and necessity of updating all involved indexes on each write operation - slower writes in other words (may be optimized using logical pointers to the head of tuple versions chain). Cooperative VAC (update operations remove invisible for all readers tuple versions) is possible.
> 2) O2N (older to newer) - a reader gets the oldest visible tuple version and iterates over versions until it gets visible version. It allows not to update all indexes (except the case when an index value is changed), write operations become lighter. Cooperative VAC almost impossible.
> We need to decide which approach to use depending on that load profile is preferable (OLTP/OLAP)


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)