db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey Lichtman <swa...@rcn.com>
Subject Stack overflow from materialized subquery
Date Thu, 20 Oct 2005 19:10:32 GMT
I have figured out what's behind the stack overflow reported by 
Daniel Skiles in the derby-user list. To summarize, he gets a stack 
overflow from this query:

     SELECT path from filsystemfiles where
         path not in (select path from existingfiles)

The stack trace shows a large number of UnionResultSets calling 
openCore(). The odd thing, of course, is that the query doesn't 
contain any unions.

It turns out that, at some point, an optimization was added to 
subquery processing to materialize the subquery result in memory in 
certain cases. According to the comments, the implementor of this 
change was Tingjian Ge. The comments also refer to "beetle 4373", 
which I assume is a cute way of referring to a bug number.

The in-memory materialization works by reading the rows into memory 
and constructing a set of nested UnionResultSets on the fly. This 
happens in BaseActivation.java in the language execution code, in the 
method materializeResultSetIfPossible(). This code tries to ensure 
that the rows will fit in memory, but it doesn't take the possibility 
of a stack overflow as a result of nested unions into account.

The obvious solution is to get rid of the nested union technique in 
materializeResultSetIfPossible(), and to replace it with an iterative 
technique (as opposed to a recursive one). This should be pretty 
easy, as the method already puts the rows into a Vector before 
creating the unions.

I should say, however, that some aspects of this optimization bother 
me. The biggest problem is that it could make some queries run much 
slower. Consider the following:

     select * from one_row_table where column1 not in
         (select column2 from million_row_table)

The "unoptimized" query will read the single row from one_row_table 
and do a probe into million_row_table to see if it finds a match. If 
there is an index on million_row_table.column2, it will return a 
result very quickly despite the large size of million_row_table.

With the materialization "optimization" it will read all of 
million_row_table into memory, even if there's an index on column2. 
This is obviously a big performance hit.

Since subquery materialization of this type can be expensive, the 
decision as to whether to do it should be based on query optimizer 
cost estimates. The optimizer knows what the cost is of materializing 
a subquery, and also what the cost is of getting a row from an 
unmaterialized subquery.

I also wonder about the wisdom of having the language cache partial 
query results in memory. Although this can help performance in some 
cases, it also chews up memory. This is different from a limited-size 
cache with a backing store (like what the store uses for page 
caching). The language has no way to limit the total amount of memory 
used in this type of processing.

I also have some problems with the way this optimization was 
implemented. The decision to materialize the subquery results in 
memory is made during code generation - all such decisions should be 
made during the compilation phase. It's not clear to me why 
materializeResultSetIfPossible() is in BaseActivation - I would 
expect the of materialization to be done by a type of ResultSet, not 
by a method in BaseActivation. Also, this method calls 
getMaxMemoryPerTable() in the OptimizerFactory - nothing in the 
execution code should refer to anything in the compilation code 
(perhaps getMaxMemoryPerTable() should be moved somewhere else).

Thoughts, anyone?


                        -        Jeff Lichtman
                                 swazoo@rcn.com
                                 Check out Swazoo Koolak's Web Jukebox at
                                 http://swazoo.com/ 


Mime
View raw message