db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Army <qoz...@gmail.com>
Subject Re: slow subqueries
Date Fri, 10 Nov 2006 16:53:08 GMT
Jim Newsham wrote:
> Why do queries with subqueries perform so poorly?

This is one of the places where Derby optimization could definitely use some 
improvement.  There are several know issues with the optimization of subqueries 
that can potentially lead to sub-optimal plans, including unreasonably high cost 
estimates (DERBY-1905, DERBY-1260), lack of an effective pruning algorithm 
(DERBY-1907), and a possibly insufficient "timeout" mechanism (DERBY-1906). 
There has been work in this area to hopefully improve the situation for 10.2 
(DERBY-805, DERBY-781, DERBY-1007), but as you yourself have noticed, we have a 
long way to go.

> The following query completes almost instantly.  I suppose Derby simplifies
> the subquery internally to "band.id = 11":
> 
> ij(SAMPLEBASE)> select count(*) from time where time.id in (select distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id and double_sample.fk_time_id = time.id and band.id
> in (11));
> 
> This query "does not complete":
> 
> ij(SAMPLEBASE)> select count(*) from time where time.id in (select distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id and double_sample.fk_time_id = time.id and band.id
> in (11, 12));

Using the DDL that you posted in an earlier email I ran these queries and they 
both completed in less than a second.  Note, though, that I didn't have any data 
in the tables :)  That may sound like useless information to you, but it tells 
me that the query compilation is at least completing in a reasonable amount of 
time, which is a first step.

See http://wiki.apache.org/db-derby/PerformanceDiagnosisTips

In your case do you know how much time is being spent in compilation vs 
execution?  The easiest way to tell this in ij is to prepare the query first, 
then execute it:

ij> elapsedtime on;
ij> prepare ps1 as 'select count(*) from time where time.id in (select distinct 
time.id from time, double_sample, band where band.id = double_sample.fk_band_id 
and double_sample.fk_time_id = time.id and band.id in (11))';
ELAPSED TIME = 2344 milliseconds
ij> execute ps1;
1
-----------
0

1 row selected
ELAPSED TIME = 60 milliseconds
ij> prepare ps2 as 'select count(*) from time where time.id in (select distinct 
time.id from time, double_sample, band where band.id = double_sample.fk_band_id 
and double_sample.fk_time_id = time.id and band.id in (11, 12))';
ELAPSED TIME = 301 milliseconds
ij> execute ps2;
1
-----------
0

1 row selected
ELAPSED TIME = 10 milliseconds

If you have a script or a Java program that loads data into your tables, and if 
you are willing and able to post that script/program to the list, then those who 
are following this discussion might be more easily able to reproduce the 
problem, which would help investigation.  If you cannot or do not want to post 
such a script/program, then any description about the kind and amount of data in 
the relevant tables would be helpful, as well.  For example, what value do you 
expect "count(*)" to return in these queries?  How many rows does the subquery 
in the IN clause return?  How many rows are in the various tables?

> I've been searching/reading around and trying to understand a little about
> how Derby works and how to check the statement plan.  I've read through the
> tuning document.  Unfortunately if a query doesn't complete, I can't get the
> statement plan.

Actually, I think you only need to retrieve the first row in order to get the 
statement plan.  For example, if you set "derby.language.logQueryPlan=true" and 
then run the following:

ij> get cursor c1 as 'select * from time where time.id in (select distinct 
time.id from time, double_sample, band where band.id = double_sample.fk_band_id 
and double_sample.fk_time_id = time.id and band.id in (11, 12))';
ij> next c1;
ij> close c1;
ij> exit;

then look at derby.log, you should (hopefully) see the query plan.

NOTE: I removed the "count(*)" from the query so that we can fetch one row at a 
time.  If you do this, you should see the query plan in the log file after the 
first row has been retrieved.  Of course, whether or not this will help depends 
on how long it takes to retrieve the first row--but it's worth a try.

> Is the entire inner query being re-evaluated for every candidate record in
> the outer query?

That's possible, yes.  DERBY-781 made some changes to materialize a subquery by 
use of a hash join, but a hash join requires an equijoin predicate.  For your 
query there is no explicit equijoin between "time" and the subquery, so I do not 
think DERBY-781 will help.  Therefore it is quite possible that the subquery is 
in fact being re-evaluated multiple times, as you say.

> I read somewhere that inner queries may not be materialized if they are 
> larger than a certain threshold.

When talking about hash joins, the estimated in-memory size of the inner 
table/query has to be less than derby.language.maxMemoryPerTable, where that 
property is specified in kilobytes and defaults to 1024 (i.e. default max mem 
per table is 1 meg).  You can try setting that property to a larger value 
(within the constraints of JVM heap), but I do not know if that will help.  It 
may make hash joins more feasible *within* the subquery, but I do not think it 
will allow materialization of the subquery itself since, as mentioned above, 
there is no explicit equijoin with the subquery.

> If the inner result set is too large for memory, couldn't it be cached 
> to disk?

Yes, it can.  And if, for example, you set the derby.language.maxMemoryPerTable 
property to a value that is too large for memory, this "spill-over" to disk will 
in fact occur.  I think there is an improvement filed somewhere to make it so 
that instead of "aborting" a hash join when the inner result set is too large, 
the optimizer should just make the join more expensive to account for the 
spill-to-disk i/o.  But I don't think anyone has made that improvement yet...

So in the end, there's probably not much information in this email that will 
tell you how to make your query run more quickly.  As you can see, though, there 
is plenty of room for improvement in this particular area of Derby.  If this is 
something that is exciting and/or essential for you, you should definitely feel 
free to join the development community and to help address the problem.  We'd be 
thrilled to have you!

Army


Mime
View raw message