cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrés de la Peña (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-12245) initial view build can be parallel
Date Thu, 12 Oct 2017 08:11:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200126#comment-16200126
] 

Andrés de la Peña edited comment on CASSANDRA-12245 at 10/12/17 8:10 AM:
-------------------------------------------------------------------------

Sorry for the delay, and thank for the thorough review. I've updated [the patch|https://github.com/apache/cassandra/compare/trunk...adelapena:12245-trunk]
addressing the comments.

bq. It seems like the way the number of tokens in a range was computed by abs(range.right
- range.left) may not work correctly for some wrap-around cases, as shown by [this test case|https://github.com/pauloricardomg/cassandra/blob/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237/test/unit/org/apache/cassandra/dht/SplitterTest.java#L184].
Even though this shouldn't break when local ranges are used , I fixed it on [this commit|https://github.com/pauloricardomg/cassandra/commit/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237]
to make sure split works correctly for wrap-around ranges. Can you confirm this is correct?

I think this is right, merged.

bq. Other than that, it seems like you added unit tests only for Murmur3Partitioner, would
you mind extending testSplit() to RandomPartitioner?

Done [here|https://github.com/apache/cassandra/commit/6f92541abaa61ed789f50f67c8800af3f97162a5].

bq. I think having a dedicated executor will ensure view building doesn't compete with compactions
for the compaction executor, good job! One problem I see though is that if the user is finding
its view building slow it will try to increase the number of concurrent view builders via
nodetool, but it will have no effect since the range was split in the previously number of
concurrent view builders. Given this will be a pretty common scenario for large datasets,
how about splitting the range in multiple smaller tasks, so that if the user increases {{concurrent_view_builders}}
the other tasks immediately start executing?
bq. We could use a simple approach of splitting the local range in let's say 1000 hard-coded
parts, or be smarter and make each split have ~100MB or so. In this way we can keep {{concurrent_materialized_view_builders=1}}
by default, and users with large base tables are able to increase it and see immediate effect
via nodetool. WDYT?

This makes a lot of sense. I'm worried about creating thousands of tasks for large datasets
if the number of tasks is relative to the amount of data. Instead, I think we could fix the
number of partitions to the higher reasonable number of parallel tasks, something like [a
multiple of the number of available processors|https://github.com/adelapena/cassandra/blob/e460b3b76935cad60a9dc1e00c8d3af8bfa9584a/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L59].
This would provide the desired immediate performance improvement if the user increases the
number of concurrent view builders while keeping the number of tasks limited, independently
of the amount of data. What do you think? Does it make any sense?

bq. Great, looks much cleaner indeed! One minor thing is that if there's a failure after some
{{ViewBuildTasks}} were completed, it will resume that subtask from its last token while it
already finished. Could we maybe set the last_token = end_token when the task is finished
to flag it was already finished and avoid resuming the task when that is the case?

Done [here|https://github.com/adelapena/cassandra/blob/d505d8014524b72422ba1eee036494bda39f53f0/src/java/org/apache/cassandra/db/view/ViewBuilderTask.java#L166-L168].

One case that we hadn't considered is that if the token ranges change or are split in a different
way when resuming a build then the progress would be lost, because {{ViewBuildTask}} won't
found any entry for the new range at {{system.view_builds_in_progress}}. This would be specially
true if we split the ranges by their data size. So, independently of how we finally split
the ranges, I think it makes sense to load all the ranges with any progress from {{system.view_builds_in_progress}}
at {{ViewBuilder}} before splitting the local ranges, create a task for those of them that
are not already finish, and then split any remaining uncovered local range. It also has the
advantage of skipping the creation of tasks for already completed ranges. What do you think?

I have also removed the method {{SystemKeyspace.beginViewBuild}} because I don't see any need
of saving a range without progress. Indeed, if the view build is restarted it is probably
better to don't restore the task without progress and let their tokens to be processed by
the split logic. 

bq. The dtest looks mostly good, except for the following nits:
bq.  * {{concurrent_materialized_view_builders=1}} when the nodes are restarted. Can you set
the configuration value during cluster setup phase (instead of setting via nodetool) to make
sure the restarted view builds will be parallel?

Sure, done [here|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L949-L951].

bq.  * can probably use {{self._wait_for_view("ks", "t_by_v")}} [here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L982]

The first wait does the opposite to {{_wait_for_view}}, it waits until there is some progress.
But we can use {{self._wait_for_view("ks", "t_by_v")}} [here|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L1005].
Did you mean that?

bq.  * We cannot ensure key 10000 was not built here which may cause flakiness, so it's probably
better to check for {{self.assertNotEqual(len(list(session.execute("SELECT count(*) FROM t_by_v;"))),
10000)}} or something like that.

Good catch, done [here|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L1002].

bq.  * It would be nice to check that the view build was actually removed on restart, by checking
for the log entry {{Resuming view build for range}}

Done [here|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L1020]
and also [before restarting|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L991-L994]

bq. I created a [PR|https://github.com/adelapena/cassandra/pull/1] on your branch with the
above suggestions.

Merged, thanks!

bq. Even though the patch is looking good and has some dtest coverage, I feel that we are
still missing some unit testing to have confidence this is working as desired and catch any
subtle regression, given this is critical for correct MV functioning. With that said, it would
be nice if we could test that {{ViewBuilderTask}} is correctly building a specific range and
maybe extend {{ViewTest.testViewBuilderResume}} to test view building/resume with different
number of concurrent view builders. What do you think?

Makes sense. The new test for {{ViewBuilderTask}} is [here|https://github.com/adelapena/cassandra/blob/d505d8014524b72422ba1eee036494bda39f53f0/test/unit/org/apache/cassandra/db/view/ViewBuilderTaskTest.java],
and I've extended [{{ViewTest.testViewBuilderResume}}|https://github.com/adelapena/cassandra/blob/d505d8014524b72422ba1eee036494bda39f53f0/test/unit/org/apache/cassandra/cql3/ViewTest.java#L1323-L1388]
to run with different number of concurrent view builders.

CI is running at:
||[utest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-testall/]||[dtest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-dtest/]||


was (Author: adelapena):
Sorry for the delay, and thank for the thorough review. I've updated [the patch|https://github.com/apache/cassandra/compare/trunk...adelapena:12245-trunk]
addressing the comments.

bq. It seems like the way the number of tokens in a range was computed by abs(range.right
- range.left) may not work correctly for some wrap-around cases, as shown by [this test case|https://github.com/pauloricardomg/cassandra/blob/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237/test/unit/org/apache/cassandra/dht/SplitterTest.java#L184].
Even though this shouldn't break when local ranges are used , I fixed it on [this commit|https://github.com/pauloricardomg/cassandra/commit/2760bbbc25a2ad9a9bbf9d29a0dc19e1e3bfb237]
to make sure split works correctly for wrap-around ranges. Can you confirm this is correct?

I think this is right, merged.

bq. Other than that, it seems like you added unit tests only for Murmur3Partitioner, would
you mind extending testSplit() to RandomPartitioner?

Done [here|https://github.com/apache/cassandra/commit/6f92541abaa61ed789f50f67c8800af3f97162a5].

bq. I think having a dedicated executor will ensure view building doesn't compete with compactions
for the compaction executor, good job! One problem I see though is that if the user is finding
its view building slow it will try to increase the number of concurrent view builders via
nodetool, but it will have no effect since the range was split in the previously number of
concurrent view builders. Given this will be a pretty common scenario for large datasets,
how about splitting the range in multiple smaller tasks, so that if the user increases {{concurrent_view_builders}}
the other tasks immediately start executing?
bq. We could use a simple approach of splitting the local range in let's say 1000 hard-coded
parts, or be smarter and make each split have ~100MB or so. In this way we can keep {{concurrent_materialized_view_builders=1}}
by default, and users with large base tables are able to increase it and see immediate effect
via nodetool. WDYT?

This makes a lot of sense. I'm worried about creating thousands of tasks for large datasets
if the number of tasks is relative to the amount of data. Instead, I think we could fix the
number of partitions to the higher reasonable number of parallel tasks, something like [a
multiple of the number of available processors|https://github.com/adelapena/cassandra/blob/e460b3b76935cad60a9dc1e00c8d3af8bfa9584a/src/java/org/apache/cassandra/db/view/ViewBuilder.java#L59].
This would provide the desired immediate performance improvement if the user increases the
number of concurrent view builders while keeping the number of tasks limited, independently
of the amount of data. What do you think? Does it make any sense?

bq. Great, looks much cleaner indeed! One minor thing is that if there's a failure after some
{{ViewBuildTasks}} were completed, it will resume that subtask from its last token while it
already finished. Could we maybe set the last_token = end_token when the task is finished
to flag it was already finished and avoid resuming the task when that is the case?

Done [here|https://github.com/adelapena/cassandra/blob/d505d8014524b72422ba1eee036494bda39f53f0/src/java/org/apache/cassandra/db/view/ViewBuilderTask.java#L166-L168].

One case that we hadn't considered is that if the token ranges change or are split in a different
way when resuming a build then the progress would be lost, because {{ViewBuildTask}} won't
found any entry for the new range at {{system.view_builds_in_progress}}. This would be specially
true if we split the ranges by their data size. So, independently of how we finally split
the ranges, I think it makes sense to load all the ranges with any progress from {{system.view_builds_in_progress}}
at {{ViewBuilder}} before splitting the local ranges, create a task for those of them that
are not already finish, and then split any remaining uncovered local range. It also has the
advantage of skipping the creation of tasks for already completed ranges. What do you think?

I have also removed the method {{SystemKeyspace.beginViewBuild}} because I don't see any need
of saving a range without progress. Indeed, if the view build is restarted it is probably
better to don't restore the task without progress and let their tokens to be processed by
the split logic. 

bq. The dtest looks mostly good, except for the following nits:
bq.  * {{concurrent_materialized_view_builders=1}} when the nodes are restarted. Can you set
the configuration value during cluster setup phase (instead of setting via nodetool) to make
sure the restarted view builds will be parallel?

Sure, done [here|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L949-L951].

bq.  * can probably use {{self._wait_for_view("ks", "t_by_v")}} [here|https://github.com/adelapena/cassandra-dtest/blob/cf893982c361fd7b6018b2570d3a5a33badd5424/materialized_views_test.py#L982]

The first wait does the opposite to {{_wait_for_view}}, it waits until there is some progress.
But we can use {{self._wait_for_view("ks", "t_by_v")}} [here|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L1005].
Did you mean that?

bq.  * We cannot ensure key 10000 was not built here which may cause flakiness, so it's probably
better to check for {{self.assertNotEqual(len(list(session.execute("SELECT count(*) FROM t_by_v;"))),
10000)}} or something like that.
It would be nice to check that the view build was actually removed on restart, by checking
for the log entry {{Resuming view build for range}}
Good catch, done [here|https://github.com/adelapena/cassandra-dtest/blob/12245/materialized_views_test.py#L1002].

bq. I created a [PR|https://github.com/adelapena/cassandra/pull/1] on your branch with the
above suggestions.

Merged, thanks!

bq. Even though the patch is looking good and has some dtest coverage, I feel that we are
still missing some unit testing to have confidence this is working as desired and catch any
subtle regression, given this is critical for correct MV functioning. With that said, it would
be nice if we could test that {{ViewBuilderTask}} is correctly building a specific range and
maybe extend {{ViewTest.testViewBuilderResume}} to test view building/resume with different
number of concurrent view builders. What do you think?

Makes sense. The new test for {{ViewBuilderTask}} is [here|https://github.com/adelapena/cassandra/blob/d505d8014524b72422ba1eee036494bda39f53f0/test/unit/org/apache/cassandra/db/view/ViewBuilderTaskTest.java],
and I've extended [{{ViewTest.testViewBuilderResume}}|https://github.com/adelapena/cassandra/blob/d505d8014524b72422ba1eee036494bda39f53f0/test/unit/org/apache/cassandra/cql3/ViewTest.java#L1323-L1388]
to run with different number of concurrent view builders.

CI is running at:
||[utest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-testall/]||[dtest|http://jenkins-cassandra.datastax.lan/view/Dev/view/adelapena/job/adelapena-12245-trunk-dtest/]||

> initial view build can be parallel
> ----------------------------------
>
>                 Key: CASSANDRA-12245
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12245
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Materialized Views
>            Reporter: Tom van der Woerdt
>            Assignee: Andrés de la Peña
>             Fix For: 4.x
>
>
> On a node with lots of data (~3TB) building a materialized view takes several weeks,
which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one thread
>  * just iterate through sstables, not worrying about duplicates, and include the timestamp
of the original write in the MV mutation. since this doesn't exclude duplicates it does increase
the amount of work and could temporarily surface ghost rows (yikes) but I guess that's why
they call it eventual consistency. doing it this way can avoid holding references to all tables
on disk, allows parallelization, and removes the need to check other sstables for existing
data. this is essentially the 'do a full repair' path



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message