Mailing-List: contact derby-user-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Derby Discussion" <derby-user@db.apache.org>
Received-SPF: pass (athena.apache.org: local policy)
Subject: RE: Bad table design leads to  bad performance...RE: Bad
 performance with UPDATE on a nested SELECT
From: Mikkel Kamstrup Erlandsen <mke@statsbiblioteket.dk>
To: "msegel@segel.com" <msegel@segel.com>
CC: 'Derby Discussion' <derby-user@db.apache.org>
In-Reply-To: <20090113170216.69E3F6057E@dbrack01.segel.com>
References: <20090113170216.69E3F6057E@dbrack01.segel.com>
Content-Type: text/plain
Date: Tue, 13 Jan 2009 23:47:25 +0100
Message-ID: <1231886845.9425.25.camel@delight>
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

On Tue, 2009-01-13 at 16:59 +0100, Michael Segel wrote:
> See my comments mixed in...
> 
> Just a caveat, you're right, I don't know your entire application, I'm just
> looking at it from what little information you have shared in your posts....

I apologize if I am cryptic, there are no secrets at all[1], but we do
have a lot of odd requirements from the data model that I wont bore you
with if I can avoid it :-)

[1]: I am working on an open source project called Summa:
http://wiki.statsbiblioteket.dk/summa

More comments inlined below...

> > -----Original Message-----
> > From: Mikkel Kamstrup Erlandsen [mailto:mke@statsbiblioteket.dk]
> > Sent: Tuesday, January 13, 2009 8:51 AM
> > To: Derby Discussion; msegel@segel.com
> > Subject: Re: Bad table design leads to bad performance...RE: Bad
> > performance with UPDATE on a nested SELECT
> > 
> [SNIP]
> > 
> > It will not be trivial for us to "just" use integer ids. The problem is
> > that either end of a relation will not necessarily exist in the base.
> > 
> 
> In your update statement, as written, unless there is a parent to child
> relationship established, you will not update a base record. Your statement
> is matching the base record's id on a match to a parentId which has a
> specific child record.
> 
> This means that you *can* have a base record that will not get updated.

Yes, this is intentional. In the "real" query I actually update the
timestamps of all parent records of a given record. I just used 'base'
here instead of timestamps because it is simpler in writing (and has the
exact same execution pattern).

<SNIP>
> >   id: book_1
> >   children: book_2, book_3
> >   parents:
> > 
> > 7 days later we receive metadata for book_2:
> > 
> >   id: book_2
> >   children:
> >   parents:
<SNIP>
> Yes! That is exactly what I am suggesting. I wasn't sure but you're
> indicating that 'book_1', 'book_2' and 'book_3' are all the same data types
> where the role is going to be table dependent. So you can create a simple
> table that maps a book varchar to an identity value. (A unique integer value
> with a backing index) This implies that there can only be one id for
> 'book_1'. Also you would then want an index on the varchar only for this
> table. (One fat index instead of 3)
> 
> By doing this, you do not have any need for anything more than allowing
> nulls in the column. 

Ok, I've tried this. More on this below...

> Using your example above, you say that book_1 has children (book_2 and
> book_3) Does this not imply that for book_2 that book_1 is its parent?

Yes.

> So do you add a parent child relationship, or do you keep it null until
> there is a metadata record and if the metadata record indicates that there
> are no relationships, do you then remove any relationships that might have
> been set prior to the new record?

When we put book_1 in the db we simply add the tuples (book_1,book_2)
and (book_1,book_3) to the relations table.

> Yes. My point is that Derby isn't PostgresSQL and there are a lot of
> internal factors that could cause the problems you are seeing. How database
> engines handle varchars internally could be an issue when concerning
> performance. In terms of languages, C is very different from Java. There are
> things that you can do in C that you can't do in Java.
> 
> With respect to your problem, I believe that there is an issue with the
> query optimizer. By cleaning up your table design the optimizer may make a
> better selection. As a last resort, you then apply the hint.
> 
> What I am suggesting is that for testing purposes only, you create the table
> that maps a varchar to an identity integer. Then for each row in your base
> table, you insert a row in to a new table that stores the integer id instead
> of the varchar. You then do the same for the relationship table.
> 
> When you re-run the query you can either join the table in the inner select,
> or you could just hardcode the integer value instead of the varchar value.
> 
> Yes there is minimal cost for the join, but you're talking milliseconds.
> Weigh that against the time spent on a sequential scan of 1.5million +
> records...

As I said above, I just tried out your strategy. Using only integer
handles the query runs in about 4s. I still need a factor 100 better
than that...

Instead of using Derby properties we will probably resolve to hacking
around this in the code by doing the SELECT first and then iterating
manually over the result set doing one UPDATE per row... Before it gets
to that I'll give it a final stab tomorrow when I am rested.

Cheers,
Mikkel