cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Zamata (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-5741) Provide a way to disable automatic index rebuilds during bulk loading
Date Fri, 14 Mar 2014 15:54:52 GMT


Jim Zamata commented on CASSANDRA-5741:

One example comes from Oracle.  Indexes can be marked "unusable" so they will not be updated
during bulk operations.  Then another ALTER INDEX statement is issued afterward to rebuild

SQLServer has an ALTER INDEX ... DISABLE that can be used to do the same thing.  Afterward,
ALTER INDEX ... REBUILD must be called to rebuild the index(es).

The advantage of this approach over simply dropping and re-adding, which is what we do right
now, is that the index metadata is preserved.  Admittedly this may be the only advantage.
 Since the indexes must be rebuilt afterward, there would be no speed advantage over dropping
and re-adding.  The client just does not have to store the index metadata and use it to recreate
the indexes.

Clearly a disadvantage of this approach, is that when you disable an index this way, or mark
it "unusable", the server must be sure not to use it, even though it still exists, since it
is implicitly inconsistent.  That requires extra state information and extra logic, beyond
that required to skip the index updates.  Depending on how much complexity that adds to the
update and query logic, the added convenience might not be worth it.

> Provide a way to disable automatic index rebuilds during bulk loading
> ---------------------------------------------------------------------
>                 Key: CASSANDRA-5741
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.6
>            Reporter: Jim Zamata
> When using the BulkLoadOutputFormat the actual streaming of the SSTables into Cassandra
is fast, but the index rebuilds can take several minutes. Cassandra does not send the response
until after all of the rebuilds for a streaming session complete. This causes the tasks to
appear to hang at 100%, since the record writer streams the files in its close method.  If
the rebuilding process takes too long, the tasks can actually time out.
> Many SQL databases provide bulk insert utilities that disable index updates to allow
large amounts of data to be added quickly.  This functionality would serve a similar purpose.
> An alternative might be an option that would allow the session to return once the SSTables
had been successfully imported without waiting for the index builds to complete.  However,
I have noticed heavy CPU loads during the index rebuilds, so bulkload performance might be
better if this step could be deferred until after all of the data is loaded. 

This message was sent by Atlassian JIRA

View raw message