db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Hillegas <rick.hille...@oracle.com>
Subject Re: bulk insert
Date Fri, 27 Aug 2010 19:59:45 GMT
Hi Mike,

Thanks for the quick response. Some comments inline...

Mike Matrigali wrote:
> I would vote -1 on this, if the proposal is to allow unlogged inserts
> into non-empty tables.  I do not want to add syntax that basically
> allows users to silent corrupt their db.
I poorly described what I meant. I am  not suggesting that we invent a 
new mechanism for bulk insert. I'm merely suggesting that we re-use the 
existing bulk insert mechanism used by the import procedures. When I 
said that logging would be turned off, I only meant that it would be 
turned off in the way that it is for the import procedures. This is my 
understanding of how that existing logic works:

o The import procedure cooks up an INSERT statement which 
specifies"insertMode-bulkInsert". At execution time, if that INSERT 
statement specifies "insertMode=bulkInsert" AND the table is empty, then 
a new conglomerate is created on the side and the inserts go into that 
new conglomerate. As you note, the conglomerate creation is logged and 
the old conglomerate is replaced with the new conglomerate only if the 
insert succeeds.

o If the table is not empty at execution time, then the bulkInsert 
directive is ignored and a new conglomerate is not created. In this 
case, the inserts go into the old conglomerate and they are logged.
> The use of bulk insert has been
> carefully used only where we can guarantee the consistency of the
> database.  There is currently no risk because logging is disabled, 
> because logging isn't really disabled.  We don't log the actual inserts
> but we log enough (and carefully do correct data syncing) so that if 
> transaction does not complete recovery
> can do the right thing in backing out the transaction.  The usual case
> is an empty table where the logging system knows that it can easily
> get back to the empty state using ddl rather than backing out each
> individual insert - this is the magic of using a new conglomerate for
> the bulk operation.
My belief is that since we would re-use the existing import mechanism, 
then we would not create any cases which are not already possible today. 
There would just be more inserts which would use this mechanism. If this 
can lead to data corruption, then that means that the existing import 
mechanism is subject to the same kind of corruption.
> Allowing no logging as is suggested opens us up to a all sorts of 
> corruptions which
> may not become visible until long after the offending statement has
> been executed.  Just a simple insert and a crashed database can easily
> corrupt an index by resulting in the index having an entry on disk where
> the base table does not, or the opposite - all just by what happens to
> LRU in the cache.  Even if you tell users that they have to recover
> themselves, there is just no way to help them even know if they have to.
> Rick Hillegas wrote:
>> Bulk insert is a special mode which causes Derby to repopulate a 
>> table without logging the rows. Under the hood, bulk insert mode 
>> causes the interpreter to call 
>> TransactionController.recreateAndLoadConglomerate(). The import 
>> procedures use this special mode to get better performance. With 
>> logging disabled there is always the risk that something could go 
>> wrong and the target table could end up corrupted. In the case of 
>> such a corruption, the user has to drop and recreate the table. I 
>> don't know if there are any wormholes in this area. I'm not aware of 
>> any bugs which have been logged because users could not return to an 
>> uncorrupted state after a failed import.
> I am not aware of any situation where bulk insert results in this kind 
> of corruption in the current system.  If the transaction commits then
> all the unlogged data has been synced to disk and the log records for
> the ddl to switch the old and new conglomerate have been logged.  If
> the transaction does not commit all the unlogged activity has gone to
> NEW conglomerates and log records for the creation of all those 
> conglomerates have been logged.  The system has logs that allows it to
> back out all the changes to the NEW conglomerates, which it does 
> basically by dropping the NEW conglomerates on backout of the 
> transaction.
Great. I was hoping to hear that.
>> We get frequent requests to allow users to bulk import data without 
>> having to use the import procedures. Users want the performance boost 
>> of unlogged inserts without the performance drag of having to first 
>> dump the data to a file on the local disk. For instance, users want 
>> to be able to bulk import from a table function which siphons data 
>> out of an external data source.
> I do think that if you do insert as select into an empty table you will
> get the same behavior a bulk insert.  This is just from memory, and may
> not be correct.  This is what bulkInsert is meant to take advantage of.
You only get the new conglomerate (and the unlogged inserts into it) if 
you specify the "insertMode=bulkInsert" property. I believe that if the 
system behaved as you described, then you would see different 
conglomerates for the target table before and after the insert. The 
conglomerates are unchanged when I run the following experiment:

connect 'jdbc:derby:memory:dummy;create=true';

create table t( a int );
create table s( a int );
insert into s( a ) values ( 1 );

select c.conglomeratenumber, c.conglomeratename
from sys.systables t, sys.sysconglomerates c
where c.tableid = t.tableid
and t.tablename = 'T';

insert into t select * from s;

select * from t;

select c.conglomeratenumber, c.conglomeratename
from sys.systables t, sys.sysconglomerates c
where c.tableid = t.tableid
and t.tablename = 'T';

>> At first blush, this seems like an easy feature to implement. We just 
>> have to remove a small bit of logic in the parser which prevents 
>> ordinary insert statements from including one of the bulk insert 
>> directives. Those directives are just Derby properties:
>> --DERBY-PROPERTIES insertMode=bulkInsert
>> --DERBY-PROPERTIES insertMode=replace
> What exactly do you want these to do?  Should bulkInsert into a 
> non-empty table do logging (I am not sure if it does in the current 
> system).
If the table is not empty, you will get the behavior you see today if 
you import into a non-empty table: the "insertMode=bulkInsert" directive 
will be ignored and each inserted row will be logged.

> I believe replace means that if insert statement succeeds you can 
> delete all the
> existing rows of the table and just leave the new ones.  Is this 
> something useful for customers?  The mode is very useful for internal
> temp table implementation as it matches the expected behavior for some
> types of temp tables.
There is a flag on the import command which requests this replacement 
behavior. I can see that this would be useful for applications which 
need to bulk import a slug of data every day and which are not 
interested in yesterday's results.


>> I would like to make this change. That is, I would like to
>> o Allow users to specify the bulkInsert and replace properties on any 
>> insert statement.
>> o Document these properties with a warning that the operations are 
>> not logged and the user is responsible for recovering from errors.
>> This change will increase the risk that users will have to manually 
>> recover from failed, corrupting inserts.
>> I am, however, concerned that bulk insert has been deliberately 
>> disabled except for its narrow use by the import procedures. Does 
>> anyone know why this was done? Are there other risks which I should 
>> be aware of? Does anyone object to broadening this use of Derby's 
>> bulk insert feature?
>> Thanks,
>> -Rick

View raw message