db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Newsham" <jnews...@referentia.com>
Subject RE: slow subqueries
Date Sat, 11 Nov 2006 18:38:50 GMT

Hi everyone, and thanks Army for your detailed response.  My responses are
interspersed below...

> -----Original Message-----
> From: Army [mailto:qozinx@gmail.com]
> Sent: Friday, November 10, 2006 6:53 AM
> To: Derby Discussion
> Subject: Re: slow subqueries
> 
> Jim Newsham wrote:
> > Why do queries with subqueries perform so poorly?
> 
> This is one of the places where Derby optimization could definitely use
> some
> improvement.  There are several know issues with the optimization of
> subqueries
> that can potentially lead to sub-optimal plans, including unreasonably
> high cost
> estimates (DERBY-1905, DERBY-1260), lack of an effective pruning algorithm
> (DERBY-1907), and a possibly insufficient "timeout" mechanism (DERBY-
> 1906).
> There has been work in this area to hopefully improve the situation for
> 10.2
> (DERBY-805, DERBY-781, DERBY-1007), but as you yourself have noticed, we
> have a
> long way to go.
> 
> > The following query completes almost instantly.  I suppose Derby
> simplifies
> > the subquery internally to "band.id = 11":
> >
> > ij(SAMPLEBASE)> select count(*) from time where time.id in (select
> distinct
> > time.id from time, double_sample, band where band.id =
> > double_sample.fk_band_id and double_sample.fk_time_id = time.id and
> band.id
> > in (11));
> >
> > This query "does not complete":
> >
> > ij(SAMPLEBASE)> select count(*) from time where time.id in (select
> distinct
> > time.id from time, double_sample, band where band.id =
> > double_sample.fk_band_id and double_sample.fk_time_id = time.id and
> band.id
> > in (11, 12));
> 
> Using the DDL that you posted in an earlier email I ran these queries and
> they
> both completed in less than a second.  Note, though, that I didn't have
> any data
> in the tables :)  That may sound like useless information to you, but it
> tells
> me that the query compilation is at least completing in a reasonable
> amount of
> time, which is a first step.
> 
> See http://wiki.apache.org/db-derby/PerformanceDiagnosisTips
> 
> In your case do you know how much time is being spent in compilation vs
> execution?  The easiest way to tell this in ij is to prepare the query
> first,
> then execute it:
> 
> ij> elapsedtime on;
> ij> prepare ps1 as 'select count(*) from time where time.id in (select
> distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id
> and double_sample.fk_time_id = time.id and band.id in (11))';
> ELAPSED TIME = 2344 milliseconds
> ij> execute ps1;
> 1
> -----------
> 0
> 
> 1 row selected
> ELAPSED TIME = 60 milliseconds
> ij> prepare ps2 as 'select count(*) from time where time.id in (select
> distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id
> and double_sample.fk_time_id = time.id and band.id in (11, 12))';
> ELAPSED TIME = 301 milliseconds
> ij> execute ps2;
> 1
> -----------
> 0
> 
> 1 row selected
> ELAPSED TIME = 10 milliseconds
> 
> If you have a script or a Java program that loads data into your tables,
> and if
> you are willing and able to post that script/program to the list, then
> those who
> are following this discussion might be more easily able to reproduce the
> problem, which would help investigation.  If you cannot or do not want to
> post
> such a script/program, then any description about the kind and amount of
> data in
> the relevant tables would be helpful, as well.  For example, what value do
> you
> expect "count(*)" to return in these queries?  How many rows does the
> subquery
> in the IN clause return?  How many rows are in the various tables?

Unfortunately I don't have a data load script that I can release at the
moment.  My biggest problem is that I'm working on something that needs to
have been done yesterday.  I'd be happy to see performance enhancements in
Derby which apply to the cases I am using, but they won't come soon enough.
I need to find an interim solution.  Though if I do have some time later
I'll try to put together a test schema, data set, and collection of queries
that I can release.

Both the inner query and the outer query may return a very large number of
records.  In my test data set, this is approximately 20,000 each, although
in practice it can be a much larger number... perhaps millions or more.
This is why a nested loop is not going to work here... 20,000 squared
operations is very expensive, let alone millions squared.  For a query with
this profile, the inner query should only be executed once.  If it needs to
be cached, fine... but...

In my previous email I suggested the idea of Derby generating two internal
result sets (one for the outer, and one for the inner query) and then
scanning these two result sets in parallel to produce the external result
set.  I know I don't really understand the internals of Derby... does this
even make sense?  Is it feasible?  It seems like the most efficient solution
if both the inner and outer queries have a very large number of results.
For this to work, the internal result sets would need to be sorted on the
column that they're related on.  For the cases I'm working with, this would
always be a primary key column.

In fact, as a workaround to the performance problem I have, I'm very
seriously considering simulating my subqueries outside of Derby, by writing
a ResultSet implementation which merges the results of two other (Derby)
result sets iterated over in parallel as described above.  Integrating this
into my other code is going to be a pretty ugly hack, and obviously it would
be much more efficient if Derby could do this internally.  

Obviously, I'd like to hear about any flaws you see with this approach (both
within Derby and for my hack/workaround).

> 
> > I've been searching/reading around and trying to understand a little
> about
> > how Derby works and how to check the statement plan.  I've read through
> the
> > tuning document.  Unfortunately if a query doesn't complete, I can't get
> the
> > statement plan.
> 
> Actually, I think you only need to retrieve the first row in order to get
> the
> statement plan.  For example, if you set
> "derby.language.logQueryPlan=true" and
> then run the following:
> 
> ij> get cursor c1 as 'select * from time where time.id in (select distinct
> time.id from time, double_sample, band where band.id =
> double_sample.fk_band_id
> and double_sample.fk_time_id = time.id and band.id in (11, 12))';
> ij> next c1;
> ij> close c1;
> ij> exit;
> 
> then look at derby.log, you should (hopefully) see the query plan.
> 
> NOTE: I removed the "count(*)" from the query so that we can fetch one row
> at a
> time.  If you do this, you should see the query plan in the log file after
> the
> first row has been retrieved.  Of course, whether or not this will help
> depends
> on how long it takes to retrieve the first row--but it's worth a try.

Thanks for the tips... both on getting compilation time, and getting the
plan without running the full query.  Does the query plan really show up in
the derby log?  I was getting the plan in ij, which is not so convenient.  I
suppose this is because I configured logging using a database-wide property
instead of a global property.

> 
> > Is the entire inner query being re-evaluated for every candidate record
> in
> > the outer query?
> 
> That's possible, yes.  DERBY-781 made some changes to materialize a
> subquery by
> use of a hash join, but a hash join requires an equijoin predicate.  For
> your
> query there is no explicit equijoin between "time" and the subquery, so I
> do not
> think DERBY-781 will help.  Therefore it is quite possible that the
> subquery is
> in fact being re-evaluated multiple times, as you say.

To turn my query into an equijoin subquery, would I just change "where x in
(...subquery...)" to "where x = (...subquery...)"?  I believe I could do
that above, if I change the outer query to return distinct results.

But if I understand what you wrote below correctly, this is a required
condition but not a guarantee that a hash join will be used.  If the above
transformation makes sense (including for performance) and there is a way to
guarantee hash join materialization including caching to disk, I would love
to hear about it.

> 
> > I read somewhere that inner queries may not be materialized if they are
> > larger than a certain threshold.
> 
> When talking about hash joins, the estimated in-memory size of the inner
> table/query has to be less than derby.language.maxMemoryPerTable, where
> that
> property is specified in kilobytes and defaults to 1024 (i.e. default max
> mem
> per table is 1 meg).  You can try setting that property to a larger value
> (within the constraints of JVM heap), but I do not know if that will help.
> It
> may make hash joins more feasible *within* the subquery, but I do not
> think it
> will allow materialization of the subquery itself since, as mentioned
> above,
> there is no explicit equijoin with the subquery.
> 
> > If the inner result set is too large for memory, couldn't it be cached
> > to disk?
> 
> Yes, it can.  And if, for example, you set the
> derby.language.maxMemoryPerTable
> property to a value that is too large for memory, this "spill-over" to
> disk will
> in fact occur.  I think there is an improvement filed somewhere to make it
> so
> that instead of "aborting" a hash join when the inner result set is too
> large,
> the optimizer should just make the join more expensive to account for the
> spill-to-disk i/o.  But I don't think anyone has made that improvement
> yet...
> 
> So in the end, there's probably not much information in this email that
> will
> tell you how to make your query run more quickly.  As you can see, though,
> there
> is plenty of room for improvement in this particular area of Derby.  If
> this is
> something that is exciting and/or essential for you, you should definitely
> feel
> free to join the development community and to help address the problem.
> We'd be
> thrilled to have you!
> 
> Army

I appreciate your feedback, and there was plenty of useful information in
your response.  I really wish I had the time to dig into Derby's source code
to see what it's doing, let alone spend time contributing improvements.
Unfortunately I'm working on borrowed time and have absolutely no time to
deviate.

In the end, we'd like to stick with Derby if we can, because it is pure Java
and embedded, and this is a real big plus for deployment within a desktop
application such as ours.  I'm hoping that I can come up with a feasible
workaround for the subquery performance issue we have, and hope that in the
near future Derby can make my hack obsolete.

Thanks,
Jim Newsham





Mime
View raw message