Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Tue, 20 Jan 2015 19:22:35 +0000 (UTC)
From: "Russ Hatch (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12768748.1421781203000.126425.1421781755086@Atlassian.JIRA>
In-Reply-To: <JIRA.12768748.1421781203000@Atlassian.JIRA>
References: <JIRA.12768748.1421781203000@Atlassian.JIRA>
 <JIRA.12768748.1421781203256@arcas>
Subject: [jira] [Commented] (CASSANDRA-8654) Data validation test
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284249#comment-14284249 ] 

Russ Hatch commented on CASSANDRA-8654:
---------------------------------------

One notion I have explored is doing this from dtest using a simple log of row contents (on disk). My prototype used the datahelp.py functionality in dtest to create data in C* and also maintains the log which is used as the authority on what the DB rows should look like. I can expand on this idea further, but it does have some drawbacks in it's present state (it would take some work to really make it useful).

This is incomplete, but it in a very basic sense the dtest would look a bit like this: https://github.com/riptano/cassandra-dtest/blob/experimental_datatool/paging_test.py#L589
Create a log object of some kind, make a call to create a bunch of data, passing in the log so the data creation code can log expected DB state.

The other notion in this prototype was to make the logging pluggable, so if we're testing a smaller dataset then could plug in an in-memory log instead of disk: https://github.com/riptano/cassandra-dtest/blob/experimental_datatool/datahelp.py#L158

This is far from complete, but I wanted to show a kernel of the idea.

To make it really great we'd need novel schema generation (random), and the code will need to know what operations are available on a generated schema of a particular C* version. (complicated perhaps, but fun).

Another direction we could take is trying to figure out a way to do db schema/operations with semi-predictable data patterns, and could capture the on disk log as something more sparse that understands ranges (so if we have pkey 1..1000, key2 as 1..1000 there's maybe no real need to capture those million cells to a log in long-form -- we could abbreviate that somehow).

> Data validation test
> --------------------
>
>                 Key: CASSANDRA-8654
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8654
>             Project: Cassandra
>          Issue Type: Test
>            Reporter: Russ Hatch
>            Assignee: Russ Hatch
>
> There was a recent discussion about the utility of data validation testing.
> The goal here would be a harness of some kind that can mix operations and track its own notion of what the DB state should look like, and verify it in  detail, or perhaps a sampling.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)