cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Tunnicliffe (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9459) SecondaryIndex API redesign
Date Fri, 21 Aug 2015 13:44:47 GMT


Sam Tunnicliffe commented on CASSANDRA-9459:

Thanks [~slebresne], I appreciate it's a bit of a slog. I've pushed fixes that address most
of your first set of points (too many to enumerate so I'll only mention the ones which I didn't
just fix, but there's basically a new commit per-point in the branch if you want to verify
any in particular).
bq. {{getBestIndexFor}} seems to now favor indexes that handle more of the expressions. I'm
not convinced by that heuristic

The metric isn't so much how many of the original expressions can an Index handle, but the
amount of filtering left to do on the results of the index lookup. It just so happens that
all our current (built in) index implementations only reduce any filter to the degree of removing
1 expression (at most). In it's current incarnation {{RowFilter}} is limited to describing
only a very simple expression tree, one with depth of 1 and containing only AND relations.
As an effort to future-proof the Index interface somewhat I thought transforming the original
filter into one representing what the index *can't* do would be a decent heuristic. The naive
comparison in {{SIM}} which evaluates the reduced filters by number of expressions could safely
be changed later without breaking custom index implementations (at least, that was the intention).
The merits of attempting to future proof vs YAGNI are of course debatable, so if we agree
this needs more consideration then I'll happily revert the selection criteria to simply consider

bq. The initialization of an {{Index}} bothers me a bit

bq. As for registration, index can do it directly in the ctor by calling {{baseCfs.indexManager.register()}}
(we can also specify they have to do it). 

I'm not a fan of enforcing that kind of thing just by specification. Also, IoC is more explicit,
cleaner and reduces dependencies. I *would* also argue that it improves testability but as
I'm not actually testing this anywhere, I won't. Either way, I'm fine with requiring a constructor
with a specific signature & removing the {{init}} method, but I'd prefer to keep {{register}}
separate if you don't object too strongly.
Aside from that, I agree with all your points regarding construction/init/reload etc. On caveat
is that {{SIM#reload}} can't just reload the indexes it knows about already, it has to check
that every index defined in its base CFS's metadata is present. In the schema update from
executing {{CreateIndexStatement}}, only the {{CFMetaData.indexes}} is updated, not the {{ColumnFamilyStore}}
itself, which is just reloaded. When that reload happens then, the {{SIM}} needs to add the
new index which is present in {{CFM.indexes}}, but not already registered. With that exception,
I've made those changes in {{454bba7708018f872df825103aa52a99c8f653bd}}.

bq. Not a fan of using {{Optional}} as return type of {{Index.getReducedFilter}} as I would
have expected intuitively that an empty optional would means the whole filter is reduced

As noted above, this whole approach to selection is up for debate so it may be that this simply
becomes not a problem. That said, I would argue against a couple of your points here; I don't
agree that an empty optional intuitively implies the total reduction of the filter, in that
case an empty filter, not an empty optional, would seem most semantically correct to me. Secondly,
it feels fragile to make assumptions about object equality in this way, especially in an extension
point like this. I would rather not depend on documentation to enforce this sort of thing.
Prior art for doing this *kind* of thing in C* is to return null when there's caller cannot
satisfy a request and so using {{Optional}} instead seems pretty reasonable to me.

bq. I also don't find the {{Index.getReducedFilter}} naming too intuitive. I'd have prefer
something like {{Index.getUnhandledExpressions}}

Again, the point of the method isn't necessarily to identify which of *original* expressions
are unhandled, isn't it conceivable that a custom index could radically transform a filter
into one containing an entirely disjoint set of expressions from the original?

bq. In {{CassandraIndex.indexFor}}, the implementations of {{insertRow}} and {{removeRow}}
seems dangerous to me..if insertRow() is called with some tombstone, it will insert the cells
instead of removing them for instance{{insertRow}} and {{updateRow}}.

Ok that's a good point, but can you clarify do you mean regular the row may contain regular
tombstones (i.e. the result of "DELETE col FROM table WHERE ..."?) If that is the case, then
yes there is a bug currently in that we will try to insert an index entry for the (empty)
value of the tombstone. I've fixed that (in {{4d7fa1b83a92f4f5067af5ac1693ebef48135ea0}},
but I don't quite get what you mean by "it will insert the cells instead of removing them
for instance", as the tombstone has no value there's nothing we can remove - in {{updateRow}}
we can remove the value from the existing row ofc. I also think we {{removeRow}} is still
needed for cleanup & compaction

bq. Why do we care about {{IndexMetadata.TargetType}}?

The main driver for {{IndexMetadata.TargetType}} was to make explicit the difference between
indexes with dynamic and static sets of target columns. We could infer that an index greedily
processes all partition updates if its target columns is an empty set, but we discussed this
in CASSANDRA-6717 and decided that explicitly indicating that was better (at least from a
{{system_schema}} perspective).

bq. The fact that we have both {{IndexMetadata.TargetType}} and {{IndexTarget.TargetType}}
is terribly confusing

It's pretty lame, but the best I could come up with was to rename {{IndexTarget.TargetType}}
to {{IndexTarget.Type}}, it's only used within {{IndexTarget}} itself an {{CreateIndexStatement}}
so maybe the extra info in the name is redundant.

bq. In {{}}, I'd rather make the {{if (index.columns.size() == 1)}}
an assertion: we _will_ have to change this code once we support index on multiple columns
(either by erroring out, or by actually dropping the index anyway), so I'd prefer having the
code crash loudly if we forgot to update it rather than silently doing the wrong thing.

Good call, I almost did this in the first place, fixed.

bq. In {{Keyspace.getValidColumnFamilies}}, it would make more sense to me to have a {{SecondaryIndexManager.getIndexByName(String
indexName)}} method and use that (as a side note, it's weird that the method doesn't actually
use the 'idxName' variable. Do you know what gives?).

Yes, that's something of a legacy from earlier versions before CASSANDRA-6717 added {{IndexMetadata}}
(at that point I'd completely removed the {{SIM.indexes}} map and missed here when adding
it back. I'm not sure what you mean about 'idxName', looks like it's used to me (though reading
this method was making my head hurt and there was a bug in it, so I've pushed a cleaned up

bq. Is there a reason for {{IndexRegistry}}? It doesn't hurt, but it doesn't seem of any concrete
use to me (maybe it's for custom indexes, but even then I'm not really sure what it allows

Mentioned above re:IoC, if we agree that this is a reasonable approach, abstracting the registry-ness
from {{SIM}} makes it much easier to use a lightweight implementation for tests etc (argument
slightly undermined by not actually doing this anywhere).

bq. Your IDE seems to expand {{.*}} imports.

arrrgh, my settings had been lost and out of force of habit I'd been "fixing" the imports
(or so I thought) every time I made a change. I'll make sure I do a final pass through all
the modified files before this gets committed.

I'm nearly done with the points from your second comment, so I'll update again shortly

> SecondaryIndex API redesign
> ---------------------------
>                 Key: CASSANDRA-9459
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>             Fix For: 3.0 beta 1
> For some time now the index subsystem has been a pain point and in large part this is
due to the way that the APIs and principal classes have grown organically over the years.
It would be a good idea to conduct a wholesale review of the area and see if we can come up
with something a bit more coherent.
> A few starting points:
> * There's a lot in AbstractPerColumnSecondaryIndex & its subclasses which could be
pulled up into SecondaryIndexSearcher (note that to an extent, this is done in CASSANDRA-8099).
> * SecondayIndexManager is overly complex and several of its functions should be simplified/re-examined.
The handling of which columns are indexed and index selection on both the read and write paths
are somewhat dense and unintuitive.
> * The SecondaryIndex class hierarchy is rather convoluted and could use some serious
> There are a number of outstanding tickets which we should be able to roll into this higher
level one as subtasks (but I'll defer doing that until getting into the details of the redesign):
> * CASSANDRA-7771
> * CASSANDRA-8103
> * CASSANDRA-9041
> * CASSANDRA-4458
> * CASSANDRA-8505
> Whilst they're not hard dependencies, I propose that this be done on top of both CASSANDRA-8099
and CASSANDRA-6717. The former largely because the storage engine changes may facilitate a
friendlier index API, but also because of the changes to SIS mentioned above. As for 6717,
the changes to schema tables there will help facilitate CASSANDRA-7771.

This message was sent by Atlassian JIRA

View raw message