db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: bulk insert
Date Fri, 27 Aug 2010 18:51:25 GMT
I would vote -1 on this, if the proposal is to allow unlogged inserts
into non-empty tables.  I do not want to add syntax that basically
allows users to silent corrupt their db.

The use of bulk insert has been
carefully used only where we can guarantee the consistency of the
database.  There is currently no risk because logging is disabled, 
because logging isn't really disabled.  We don't log the actual inserts
but we log enough (and carefully do correct data syncing) so that if 
transaction does not complete recovery
can do the right thing in backing out the transaction.  The usual case
is an empty table where the logging system knows that it can easily
get back to the empty state using ddl rather than backing out each
individual insert - this is the magic of using a new conglomerate for
the bulk operation.

Allowing no logging as is suggested opens us up to a all sorts of 
corruptions which
may not become visible until long after the offending statement has
been executed.  Just a simple insert and a crashed database can easily
corrupt an index by resulting in the index having an entry on disk where
the base table does not, or the opposite - all just by what happens to
LRU in the cache.  Even if you tell users that they have to recover
themselves, there is just no way to help them even know if they have to.

Rick Hillegas wrote:
> Bulk insert is a special mode which causes Derby to repopulate a table 
> without logging the rows. Under the hood, bulk insert mode causes the 
> interpreter to call TransactionController.recreateAndLoadConglomerate(). 
> The import procedures use this special mode to get better performance. 
> With logging disabled there is always the risk that something could go 
> wrong and the target table could end up corrupted. In the case of such a 
> corruption, the user has to drop and recreate the table. I don't know if 
> there are any wormholes in this area. I'm not aware of any bugs which 
> have been logged because users could not return to an uncorrupted state 
> after a failed import.
I am not aware of any situation where bulk insert results in this kind 
of corruption in the current system.  If the transaction commits then
all the unlogged data has been synced to disk and the log records for
the ddl to switch the old and new conglomerate have been logged.  If
the transaction does not commit all the unlogged activity has gone to
NEW conglomerates and log records for the creation of all those 
conglomerates have been logged.  The system has logs that allows it to
back out all the changes to the NEW conglomerates, which it does 
basically by dropping the NEW conglomerates on backout of the transaction.
> We get frequent requests to allow users to bulk import data without 
> having to use the import procedures. Users want the performance boost of 
> unlogged inserts without the performance drag of having to first dump 
> the data to a file on the local disk. For instance, users want to be 
> able to bulk import from a table function which siphons data out of an 
> external data source.
I do think that if you do insert as select into an empty table you will
get the same behavior a bulk insert.  This is just from memory, and may
not be correct.  This is what bulkInsert is meant to take advantage of.
> At first blush, this seems like an easy feature to implement. We just 
> have to remove a small bit of logic in the parser which prevents 
> ordinary insert statements from including one of the bulk insert 
> directives. Those directives are just Derby properties:
> --DERBY-PROPERTIES insertMode=bulkInsert
> --DERBY-PROPERTIES insertMode=replace
What exactly do you want these to do?  Should bulkInsert into a 
non-empty table do logging (I am not sure if it does in the current system).

I believe replace means that if insert statement succeeds you can delete 
all the
existing rows of the table and just leave the new ones.  Is this 
something useful for customers?  The mode is very useful for internal
temp table implementation as it matches the expected behavior for some
types of temp tables.
> I would like to make this change. That is, I would like to
> o Allow users to specify the bulkInsert and replace properties on any 
> insert statement.
> o Document these properties with a warning that the operations are not 
> logged and the user is responsible for recovering from errors.
> This change will increase the risk that users will have to manually 
> recover from failed, corrupting inserts.
> I am, however, concerned that bulk insert has been deliberately disabled 
> except for its narrow use by the import procedures. Does anyone know why 
> this was done? Are there other risks which I should be aware of? Does 
> anyone object to broadening this use of Derby's bulk insert feature?
> Thanks,
> -Rick

View raw message