db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Larry Yogman" <larry.yog...@pervasive.com>
Subject Parallel loading, truncation, and booleans
Date Fri, 02 Mar 2007 17:59:56 GMT
==============
Use case description:
==============

I'm bulk loading derby tables from in-memory data.  I'm looking for maximum performance on
multi-threaded hardware.  My columns have many data types, with the most common being boolean
and string.  I'm willing to truncate data if necessary, and I do not need ACID properties.
 To support later queries, I need the tables to be indexed by a shared, application-generated,
64-bit primary key.

I'm using version 10.2.2.0.  I'm having some problems with performance, data truncation, and
boolean data.

========================
Problem 1. Disappointing performance
========================

I create all my tables, then start to insert using prepared sql statements like this: INSERT
INTO DATA1(DP_ROWID,F1,F2,F3) VALUES (?,?,?,?).  I use multiple threads per table, trying
to keep all available CPU cores busy.  Each thread calls executeBatch once for every 1000
inserts.  The DP_ROWID field is the application-generated key (type BIGINT).  I create the
index on DP_ROWID after the loading is finished.

On a 4-Opteron box, each CPU dual-core, my application spends over 60% of its time loading
derby tables, about 20% of its time creating derby indexes, and less than 20% of its time
doing other work.   I would like to improve the derby performance as much as possible.  Some
questions:

My derby.properties file includes: 
derby.system.durability=test  
derby.storage.pageCacheSize=10000
Any additions/changes recommended for my derby.properties settings?

Should the interval of 1000 rows between calls to executeBatch be increased?  What's a good
value? 

Am I hurting myself by using multiple threads per table?  Could this be counter productive
due to locking at the level of the whole table?  An alternative way to parallelize is to write
a larger number of smaller tables, but I would prefer not to do this as it will complicate
later queries.

Are there any known issues with creating indexes in parallel?  I have not yet parallelized
this part of the work, but once the load is faster the indexing will be a larger fraction
of the remaining time.

=================
Problem 2. Data truncation.
=================

The data inserted includes java.lang.String objects of unknown varying length and java.lang.BigDecimal
objects of unknown varying length and unknown, potentially varying precision.  The corresponding
column types are currently VARCHAR(100) and DECIMAL(31,5), but sometimes I still get truncation
errors.  Ideal would be a way to preserve the values exactly as they are, without truncation.
 Can this be done using appropriate declarations of column data types when creating the table?
If so, what's the performance impact?

For my use case, data truncation is acceptable, but failing with an error about data truncation
is not, so second best would be a way to truncate as needed. Hoping for a propery to configure
this behavior for the whole database or system.

If the approachs above do not work, then I would appreciate advice on the relative efficiency
of:

A) Changing my INSERT statements to use SQL casts  

B) Using the existing INSERT statements, and truncating values in Java code before calling
PreparedStatement.setX

================
Problem 3. Boolean data
================

I have a lot of boolean columns.  I assume that using the most space-efficient representation
will give maximum performance when loading.  However, I also need to be able to query for
true vs. false.

I started out trying to declare BOOLEAN columns, but this data type is not supported.  I'm
currently using one SMALLINT column per boolean field, and querying for 0 vs 1.  Should I
switch to CHAR(1) or CHAR(1) FOR BIT DATA?  What would queries look like?  I also considered
packing 16 bits into each SMALLINT column, but there do not seem to be any shift or mask operations
available in SQL, so I don't see how to query for true vs. false.

Mime
View raw message