cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-12245) initial view build can be parallel
Date Tue, 15 Aug 2017 04:39:00 GMT


Paulo Motta commented on CASSANDRA-12245:

Thanks for the patch and sorry for the delay! Had an initial look at the patch, overall looks
good, I have the following comments/questions/remarks:

bq. The newly created ViewBuilderController reads the local token ranges, splits them to satisfy
a concurrency factor, and runs a ViewBuilder for each of them.

It would be nice to maybe try to reuse the {{Splitter}} methods if possible, so we can reuse
tests, or if that's not straightforward maybe put the methods on splitter and add some tests
to make sure it's working correctly.

bq.  When ViewBuilderController receives the finalization signal of the last ViewBuilder,
it double-checks if there are new local ranges that weren't considered at the beginning of
the build. If there are new ranges, new {{ViewBuilder}}s are created for them.

This will not work if the range movement which created the new local range finishes after
the view has finished building. This problem exists currently and is unrelated to the view
build process itself, but more related to the range movement completion which should ensure
the views are properly built before the operation finishes, so I created CASSANDRA-13762 to
handle this properly.

bq. Given that we have a ViewBuilder per local range, the key of the table system.views_builds_in_progress
is modified to include the bounds of the token range. So, we will have an entry in the table
per each ViewBuilder. The number of covered keys per range is also recorded in the table.

Can probably remove the generation field from the builds in progress table and [remove this

bq. I have updated the patch to use a new separate table, system.views_builds_in_progress_v2

{{views_builds_in_progress_v2}} sounds a bit hacky, so perhaps we should call it {{system.view_builds_in_progress}}
(remove the s) and also add a NOTICE entry informing the previous table was replaced and data
files can be removed.

bq. The downside is that pending view builds will be restarted during an upgrade to 4.x, which
seems reasonable to me.

Sounds reasonable to me too.

bq. ViewBuilder and ViewBuilderController are probably not the best names. Maybe we could
rename ViewBuilder to something like ViewBuilderTask or ViewBuilderForRange, and rename ViewBuilderController
to ViewBuilder.

{{ViewBuilder}} and {{ViewBuilderTask}} LGTM

bq. The concurrency factor is based on conf.concurrent_compactors because the views are built
on the CompactionManager, but we may be interested in a different value.

I'm a bit concerned about starving the compaction executor for a long period during view build
of large base tables, so we should probably have another option like {{concurret_view_builders}}
with a conservative default and perhaps control the concurrency at the {{ViewBuilderController}}.

bq. The patch tries to evenly split the token ranges in the minimum number of parts to satisfy
the concurrency factor, and it never merges ranges. So, with the default 256 virtual nodes
(and a lesser concurrency factor) we create 256 build tasks. We might be interested in a different
planning. If we want the number of tasks to be lesser than the number of local ranges we should
modify the ViewBuilder task to be responsible for several ranges, although it will complicate
the status tracking.

I think this is good to start with, we can improve the planning later if necessary. I don't
think there is much gain from merging ranges to have smaller tasks.

bq. Probably there is a better way of implementing ViewBuilder.getCompactionInfo. The patch
uses keysBuilt/ColumnFamilyStore.estimatedKeysForRange to estimate the completion, which could
deal to have task completion status over 100%, depending on the estimation.

How about using {{prevToken.size(range.right)}} (introduced by CASSANDRA-7032)? Even though
this will not be available for BytesToken (used by ByteOrderedPartitioner, which is rarely
used, so could maybe fallback to the current imprecise calculation in that case).

Other comments:

* Avoid submitting view builder when view is already built instead of checking on the ViewBuilder
* ViewBuilder seems to be reimplementing some of the logic of {{PartitionRangeReadCommand}},
so I wonder if we shoud take this chance to simplify and use that instead of manually constructing
the commands via ReducingKeyIterator and multiple {{SinglePartitionReadCommands}}? We can
totally do this in other ticket if you prefer.
* Perform view marking on ViewBuilderController [instead of ViewBuilder|]
* Updating the view built status [at every key|]
is perhaps a bit inefficient and unnecessary, so perhaps we should update it every 1000 keys
or so.
* Would be nice to update the {{interrupt_build_process_test}} to stop halfway through the
build (instead of the start of the build) and verify it it's being resumed correctly with
the new changes.

> initial view build can be parallel
> ----------------------------------
>                 Key: CASSANDRA-12245
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Materialized Views
>            Reporter: Tom van der Woerdt
>            Assignee: Andrés de la Peña
>             Fix For: 4.x
> On a node with lots of data (~3TB) building a materialized view takes several weeks,
which is not ideal. It's doing this in a single thread.
> There are several potential ways this can be optimized :
>  * do vnodes in parallel, instead of going through the entire range in one thread
>  * just iterate through sstables, not worrying about duplicates, and include the timestamp
of the original write in the MV mutation. since this doesn't exclude duplicates it does increase
the amount of work and could temporarily surface ghost rows (yikes) but I guess that's why
they call it eventual consistency. doing it this way can avoid holding references to all tables
on disk, allows parallelization, and removes the need to check other sstables for existing
data. this is essentially the 'do a full repair' path

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message