cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
Date Tue, 19 Sep 2017 11:23:02 GMT


Paulo Motta commented on CASSANDRA-12245:

Finally getting to this after a while, sorry for the delay. Thanks for the update! Had another
look at the patch and this is looking much better, see some follow-up comments below:

bq. I have moved the methods to split the ranges to the Splitter, reusing its valueForToken
method. Tests here.

Awesome, looks much better now! It seems like the way the number of tokens in a range was
computed by {{abs(range.right - range.left)}} may not work correctly for some wrap-around
cases, as shown by [this test case|].
Even though this shouldn't break when local ranges are used , I fixed it on [this commit|]
to make sure split works correctly for wrap-around ranges. Can you confirm this is correct?

Other than that, it seems like you added unit tests only for {{Murmur3Partitioner}}, would
you mind extending {{testSplit()}} to {{RandomPartitioner}}?

bq. Agree. I have added a new dedicated executor in the CompactionManager, similar to the
executors used for validation and cache cleanup. The concurrency of this executor is determined
by the new config property concurrent_materialized_view_builders, which defaults to a perhaps
too much conservative value of 1. This property can be modified through both JMX and the new
setconcurrentviewbuilders and getconcurrentviewbuilders nodetool commands. These commands
are tested here.

I think having a dedicated executor will ensure view building doesn't compete with compactions
for the compaction executor, good job! One problem I see though is that if the user is finding
its view building slow it will try to increase the number of concurrent view builders via
nodetool, but it will have no effect since the range was split in the previously number of
concurrent view builders. Given this will be a pretty common scenario for large datasets,
how about splitting the range in multiple smaller tasks, so that if the user increases {{concurrent_view_builders}}
the other tasks immediately start executing?

We could use a simple approach of splitting the local range in let's say 1000 hard-coded parts,
or be smarter and make each split have ~100MB or so. In this way we can keep {{concurrent_materialized_view_builders=1}}
by default, and users with large base tables are able to increase it and see immediate effect
via nodetool. WDYT?

bq. I would prefer to do this in another ticket.


bq. I have moved the marking of system tables (and the retries in case of failure) from the
ViewBuilderTask to the ViewBuilder, using a callback to do the marking. I think the code is
clearer this way.

Great, looks much cleaner indeed! One minor thing is that if there's a failure after some
{{ViewBuildTasks}} were completed, it will resume that subtask from its last token while it
already finished. Could we maybe set the last_token = end_token when the task is finished
to flag it was already finished and avoid resuming the task when that is the case?

bq. Updated here. It also uses a byteman script to make sure that the MV build isn't finished
before stopping the cluster, which is more likely to happen.

The dtest looks mostly good, except for the following nits:
* {{concurrent_materialized_view_builders=1}} when the nodes are restarted. Can you set the
configuration value during cluster setup phase (instead of setting via nodetool) to make sure
the restarted view builds will be parallel?
* can probably use {{self._wait_for_view("ks", "t_by_v")}} [here|]
* We cannot ensure key 10000 was not built [here|]
which may cause flakiness, so it's probably better to check for {{self.assertNotEqual(len(list(session.execute("SELECT
count\(*\) FROM t_by_v;"))), 10000)}} or something like that.
* It would be nice to check that the view build was actually removed on restart, by checking
for the log entry {{Resuming view build for range}}

bq. I'm not sure about if it still makes sense for the builder task to extend CompactionInfo.Holder.
If so, I'm neither sure about how to use prevToken.size(range.right) (that returns a double)
to create CompationInfo objects. WDYT?

I think it's still useful to show view build progress via {{nodetool compactionstats}}, so
I created a new method {{Splitter.positionInRange(Token token, Range<Token> range)}}
which gives the position of a token relative to a range and used that to show view build progress
when a splitter is present. When it's not (such as the case of {{ByterOrderedPartitioner}},
we fallback to the progress based on the keys estimate. This is implemented on [this commit|].
Please let me know what do you think about this approach.

In addition to the suggestion above, I also made the following improvements:
* Select only sstables that fall into the {{ViewBuildTask}} range ([commit|])
* Simplify ViewBuilderTask loop ([commit|])
* It's pretty rare but in the case of collisions it's possible for multiple keys to share
the same token, so I updated the {{ViewBuilderTask}} loop to build all the keys sharing the
same token ([commit|])

I created a [PR|] on your branch with the above

Even though the patch is looking good and has some dtest coverage, I feel that we are still
missing some unit testing to have confidence this is working as desired and catch any subtle
regression, given this is critical for correct MV functioning. With that said, it would be
nice if we could test that {{ViewBuilderTask}} is correctly building a specific range and
maybe extend {{ViewTest.testViewBuilderResume}} to test view building/resume with different
number of concurrent view builders. What do you think?

> initial view build can be parallel
> ----------------------------------
>                 Key: CASSANDRA-12245
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Materialized Views
>            Reporter: Tom van der Woerdt
>            Assignee: Andrés de la Peña
>             Fix For: 4.x
> On a node with lots of data (~3TB) building a materialized view takes several weeks,
which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one thread
>  * just iterate through sstables, not worrying about duplicates, and include the timestamp
of the original write in the MV mutation. since this doesn't exclude duplicates it does increase
the amount of work and could temporarily surface ghost rows (yikes) but I guess that's why
they call it eventual consistency. doing it this way can avoid holding references to all tables
on disk, allows parallelization, and removes the need to check other sstables for existing
data. this is essentially the 'do a full repair' path

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message