db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Hillegas <rick.hille...@oracle.com>
Subject Re: is there some way to support bulk insert without new syntax, in a zero admin way?
Date Tue, 31 Aug 2010 19:46:44 GMT
Thanks for starting this new thread, Mike. For people who may have 
missed the earlier conversation about bulk insert, this is a 
continuation of a discussion started here: 

I think that I would be satisfied with something less ambitious: If the 
source stream for the INSERT involves a table function, then always 
attempt a bulk insert--naturally, the bulk insert optimization would be 
applied only if the target table was empty. This wouldn't give us 
simple, standard syntax for the replace functionality enjoyed by the 
import procedures, but the incremental improvement would be significant.

More comments inline...

Mike Matrigali wrote:
> bulk insert used by import is actually just a hint to the
> underlying system to use a different query execution technique for the
> inserts.
> Rather than add syntax and force users to direct the system it
> would be nice as a zero admin db if the system could figure
> out itself whether or not to do the work, when doing an insert
> that results in more than one row.
> The cost of the bulk insert execution is:
> o it has to do a scan of the table to determine if the table
>   is empty.  If it is not empty then none of the bulk insert
>   optimization is possible, so none of the following work is
>   done.
Just to be clear here, we don't do a full table scan. We just read one row.
> o It has to create a new table and new indexes for each of the
>   existing indexes.
> o it has to drop the old table and indexes on completion.
> o it has to do system catalog updates to reflect the swapping of
>   new table and indexes for old.
> o it needs to force all datapages of the new table and indexes
>   to disk and wait for all the I/O's to finish before returning.
> o it benefits from not logging the actual data of all the inserts
>   done.
> If handling this in the optimizer, one could easily estimate if
> there are rows in the table or not.  And only build a bulk insert
> plan if there are none.
> Also there should be some threshold on the number of rows being
> inserted before choosing a bulk insert plan.  One really only
> wants to do this if there is a lot of data being inserted into
> an empty table.  In the case of insert ... as select from ...
> the optimizer definitely has an estimate for the number of rows
> being inserted.  I am not sure what state this estimate is in
> for table functions.
The VTICosting interface provides a mechanism for declaring how many 
rows a table function returns. It's an awkard mechanism and in order to 
get the proposed scheme to work, we would have to expose the magic 
threshhold which triggers the optimizer to select a bulk insert 
strategy. I'm not keen on exposing optimizer magic in Derby's public api.

I think it's important to note a significant characteristic shared by 
external data sources, whether they are files we import or table 
functions which drive inserts: the Derby optimizer has limited 
visibility into external data sources and heuristics in this area are 
likely to be crude.
> If the system is not doing so already any time you do a create table 
> as select we could be using the "bulk" optimization.   I don't know if
> this is the case.  In this case one can avoid the extra system catalog 
> overhead, and would argue that it is reasonable to assume a large amount
> of data is being inserted so that "bulk" optimization is reasonable.
Note that right now the CREATE TABLE AS SELECT statement requires that 
you specify WITH NO DATA. That means that all you can do with this 
statement is create the shape of an empty table. You cannot use this 
statement to populate a table. This part of your proposal will be a 
welcome improvement, however, when we allow this statement to populate 


View raw message