Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Mon, 6 Jun 2016 17:54:21 +0000 (UTC)
From: " Brian Hess (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12729761.1406305663000.3373.1465235661375@Atlassian.JIRA>
In-Reply-To: <JIRA.12729761.1406305663000@Atlassian.JIRA>
References: <JIRA.12729761.1406305663000@Atlassian.JIRA> <JIRA.12729761.1406305663566@arcas>
Subject: [jira] [Commented] (CASSANDRA-7622) Implement virtual tables
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 06 Jun 2016 17:54:23 -0000


    [ https://issues.apache.org/jira/browse/CASSANDRA-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316875#comment-15316875 ] 

 Brian Hess commented on CASSANDRA-7622:
----------------------------------------

It appears there are a few different things that could be accomplished with Virtual Tables, and I'm curious which one this ticket will focus on - and I'd suggest that we consider breaking out the full set of things below into separate tickets as I think some deserve more discussion/debate than others.
1. CQL access to internal metrics and management.  This would be access to metrics (such as JMX) and management settings (such as cassandra.yaml settings).  These could be covered by read-only access to built-in virtual tables (update/write access is below).
2. CQL access to updating management settings.  This is different than management actions (such as `nodetool flush` or repair).
3. CQL access to management actions.  This is the category of things such as `nodetool flush`, repair, drain, etc.
4. CQL access to arbitrary classes that implement user-defined behavior.

I think that 1 is the least controversial, and clearly of high value.  My suggestion is that this be built-in Virtual Tables, perhaps in system._vt_*.  For example system._vt_metrics as SELECT access to the JMX metrics, or system._vt_settings as SELECT access to yaml/command-line/etc settings.  These would be predefined virtual tables.  I am concerned about how we will indicate how we specify which node's metrics we are querying, and how we route the query to that node (more on this below - it applies to all 4 (and more if we come up with more) scenarios).

I also think that 2 is not very controversial.  This would allow updating settings.  That could be, for example via a CQL UPDATE command (e.g., `UPDATE system._vt_settings SET max_hint_window_in_ms = 10800000 WHERE node='1.2.3.4'` - for illustrative purposes).  The reason why this is different than 3 is that after doing the update, a read will return the same value, since it was changed.  Would we want to support `UPDATE system._vt_settings SET max_hint_window_in_ms = 10800000` and update all nodes, which is not valid CQL since we don't specify a partition key?  

I'm not sure about 3, and I think it should be discussed further.  The thing is that something like `ALTER system_virtual.actions SET drain=true` or really `UPDATE system_virtual.actions SET drain=true` what would the value be if we SELECT it after the ALTER/UPDATE?  It feels like it would be set back to FALSE after the drain, which semantically feels strange.  I like [~snazy]'s suggestion of completely separate CQL syntax for these actions.  Maybe something like `EXECUTE ACTION drain ON '1.2.3.4'` - again, syntax just for illustrative purposes.  The semantics here is that the system is being altered, but outside a table, so a SELECT doesn't have the confusion as above.  As [~snazy] pointed out, we would also need to make sure we are careful about access to this capability (perhaps user/role permissions is sufficient, and we just need to add new permission types).

4 is probably the one that causes the most pause, IMO, as there would be arbitrary user code running in the read and write path.  We need to take care with that - as we have with UDFs.  It isn't exactly clear to me what the goal is for arbitrary virtual tables.  What are we trying to accomplish with this extension point?

As I mentioned above, I'm curious how we will route the query (SELECT, UPDATE, EXECUTE, whatever) to the right node.  In the simple case of SELECT of JMX metrics, how would we query the JMX metrics of node 1.2.3.4?  These queries/operations are per-node, not cluster-wide, so we need to think about that.  Is that the partition key of the virtual table (`.... PARTITION BY ((node), ...)`)?  How would we route that internally in CQL (and with awareness on the driver)?  

It sounds like it would require a special partitioner (like system tables have - or maybe ByteOrderedPartitioner could work with a customer replication strategy, but with Murmur3Partitioner things are hard (it's essentially hash inversion)) and possibly replication strategy (I don't think that a replication strategy alone would cover this, right [~jjordan]).  If these are fully user-defined (as [~jjirsa] suggests), then we need to have a way to specify partitioners at the table (or at least keyspace) level, as opposed to cluster wide (which would actually be a nice change, IMO).  If they are built-in, then we can cover this case like we do with system keyspace tables.

Another approach may be to have the drivers deal with the routing.  The Virtual Tables could be special cases for routing purposes and have an API call to specify which node to query - or the LoadBalancingPolicy could inspect the Statement and route appropriately, which could work for any/all partitioners and/or replication strategies.  If it is a LoadBalancingPolicy, then what happens if we use the "wrong" one (not the VirtualTableLoadBalancingPolicy, say) for a Virtual Table query - I guess a new Error type.


> Implement virtual tables
> ------------------------
>
>                 Key: CASSANDRA-7622
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7622
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Tupshin Harper
>            Assignee: Jeff Jirsa
>             Fix For: 3.x
>
>
> There are a variety of reasons to want virtual tables, which would be any table that would be backed by an API, rather than data explicitly managed and stored as sstables.
> One possible use case would be to expose JMX data through CQL as a resurrection of CASSANDRA-3527.
> Another is a more general framework to implement the ability to expose yaml configuration information. So it would be an alternate approach to CASSANDRA-7370.
> A possible implementation would be in terms of CASSANDRA-7443, but I am not presupposing.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)