From: Matt Doran [mailto:firstname.lastname@example.org]
Sent: Monday, December 03, 2007
Subject: Re: FW: Advice on *very*
badly performing query
Michael Segel wrote:
The short simple
answer… You get what you pay for.
The longer answer… Query
optimization is a black art. Cloudscape was designed as a lightweight no frills
Now you Cloudscape morphed in to Derby
and JavaDB. But you lose the input from the folks at IBM who handle Query
But I'm more than happy to work with the constraints
of Derby, if
only I could understand them. And that's the help I was looking for
I have to run to a customer site, but
using one of your examples… you noticed that the query performance
changed when you had the field in the select columns as well as the where
clause, but you didn’t when you had the field just in the where clause.
So keep it in the selected fields. You could also try and change the order of
the tables you’re joining.
I didn't change the where clause, just changing the
select fields causes the dramatic query plan change. I have a
feeling it might be the fact that I'm selecting an attribute from the 5th join
table ... but I'd like a better understanding of what's triggering the change
so I can avoid it if possible.
And you may want to reflect your 5 table join. Depending on the database
and its tuning. Joining more than 3-4 tables can have a drastic negative impact
on its performance.
I'm not sure what you mean to "reflect your 5
The fundamental issue here is that in this poor performing case, derby is not
looking at the index on the very large table that would immediately reduce the
dataset. For whatever reason the optimizer is making a the worst
possible case decision.
And if speed really is important
look at Informix (IDS 11) now offered by IBM.
Unfortunately as an off the shelf Java application
that runs on Windows, Mac and Linux ... we really need a simple embedded DB
that we can ship as the default. Unfortunately Derby's query optimizer let's it down badly
We use Apache Derby in our commercial application, PaperCut NG.
It's proven to be very reliable, however we occasionally get reports of very
bad performance in some areas. We haven't had the time to investigate
them fully previously (usually upgrading to an external DB like Postgres or SQL
Server fixes the issue). This time we had a look in more detail with a
recent report, and we've found some very strange performance characteristics
... and would love some advice and assistance.
We have a query that is doing inner joins to 5 tables. It's quite a
simple query, but the core table has about 300,000 rows, and where limiting the
results based on a date in that table that is indexed. Here's a summary
of my situation/findings:
- Using the latest Derby
release 10.3.1.4, with a Java 1.5 VM on Windows.
- We only have a
single WHERE clause, which is on the indexed date field is restricting the
data such that no data is returned. e.g. log_date > (latest log
date). So derby should quickly detect there is litte/no data to
- Running the original
query takes 22 minutes running 100% CPU.
- Running a count(*)
for the same query is quick (< 1 sec).
- Removing the ORDER
BY and changing the select list to just include a single field from each
table and it still takes 22 minutes.
- Changing the select
list to retrieve only a single field from 2 of the table and it still
takes 22 minutes (I have a log of the query and the runtime stats for this
- Changing the select
list to a single field from 1 of the tables makes the query run fast -
less than a second. (I have a log of the query and the runtime stats for
this attached "derby-fast.log").
- Running the original
query on the same dataset in PostgreSQL or SQL Server is very fast (less
than a second). This is why we often recommend customers upsize to
Postgres or SQL Server.
- Also the SQL is
generated via Hibernate ORM, so we have some limitations in how we can
modify the SQL.
>From the query plan it seems that seems that it stops using the date index on the
"tbl_printer_usage_log" log table, and changes from Hash joins to
Nested Loop joins. On a large table like this when providing a where
clause that on a field that is indexed .... we have to ensure that derby uses
If I increase the pageCacheSize to 100,000 pages, it reduces the time of the
query to about 2-3 minutes, but it's still very slow compared to when the
correct index is used.
Can anyone please help me understand the following:
- Why does the query
plan change dramatically, just by changing the fields that are retrieved?
- Why is derby
avoiding the most obvious index? The date field in the 300,000 row
table (the date field is the only field in our where clause).
- Is there anyway to
avoid this behavior?
If we can understand what's causing this, we'll be
able to make a much more effective use of Derby.
At the moment, on customers with large datasets, we currently just recommend
they "upsize" to Postgres or SQL Server and the problem goes
away. However, we'd much prefer to fix this and have our Derby
database behave better.
I'd be happy to provide the derby database that exhibits these problems if
someone would like to see what's going on. The database is from a
customer, so I don't want to post it publicly, but if you send me an email off-list
I'd be happy to provide it.
PaperCut Software International Pty. Ltd.
Phone: +61 (3) 9807 5767