Return-Path: Delivered-To: apmail-db-derby-user-archive@www.apache.org Received: (qmail 29772 invoked from network); 13 Jan 2009 22:47:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Jan 2009 22:47:59 -0000 Received: (qmail 3690 invoked by uid 500); 13 Jan 2009 22:47:57 -0000 Delivered-To: apmail-db-derby-user-archive@db.apache.org Received: (qmail 3668 invoked by uid 500); 13 Jan 2009 22:47:57 -0000 Mailing-List: contact derby-user-help@db.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Reply-To: "Derby Discussion" Delivered-To: mailing list derby-user@db.apache.org Received: (qmail 3659 invoked by uid 99); 13 Jan 2009 22:47:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jan 2009 14:47:57 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [130.225.24.68] (HELO sbexch03.sb.statsbiblioteket.dk) (130.225.24.68) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jan 2009 22:47:48 +0000 Received: from [127.0.0.1] (130.225.25.246) by sbexch03.sb.statsbiblioteket.dk (130.225.24.68) with Microsoft SMTP Server id 8.1.336.0; Tue, 13 Jan 2009 23:47:26 +0100 Subject: RE: Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT From: Mikkel Kamstrup Erlandsen To: "msegel@segel.com" CC: 'Derby Discussion' In-Reply-To: <20090113170216.69E3F6057E@dbrack01.segel.com> References: <20090113170216.69E3F6057E@dbrack01.segel.com> Content-Type: text/plain Date: Tue, 13 Jan 2009 23:47:25 +0100 Message-ID: <1231886845.9425.25.camel@delight> MIME-Version: 1.0 X-Mailer: Evolution 2.24.2 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On Tue, 2009-01-13 at 16:59 +0100, Michael Segel wrote: > See my comments mixed in... > > Just a caveat, you're right, I don't know your entire application, I'm just > looking at it from what little information you have shared in your posts.... I apologize if I am cryptic, there are no secrets at all[1], but we do have a lot of odd requirements from the data model that I wont bore you with if I can avoid it :-) [1]: I am working on an open source project called Summa: http://wiki.statsbiblioteket.dk/summa More comments inlined below... > > -----Original Message----- > > From: Mikkel Kamstrup Erlandsen [mailto:mke@statsbiblioteket.dk] > > Sent: Tuesday, January 13, 2009 8:51 AM > > To: Derby Discussion; msegel@segel.com > > Subject: Re: Bad table design leads to bad performance...RE: Bad > > performance with UPDATE on a nested SELECT > > > [SNIP] > > > > It will not be trivial for us to "just" use integer ids. The problem is > > that either end of a relation will not necessarily exist in the base. > > > > In your update statement, as written, unless there is a parent to child > relationship established, you will not update a base record. Your statement > is matching the base record's id on a match to a parentId which has a > specific child record. > > This means that you *can* have a base record that will not get updated. Yes, this is intentional. In the "real" query I actually update the timestamps of all parent records of a given record. I just used 'base' here instead of timestamps because it is simpler in writing (and has the exact same execution pattern). > > id: book_1 > > children: book_2, book_3 > > parents: > > > > 7 days later we receive metadata for book_2: > > > > id: book_2 > > children: > > parents: > Yes! That is exactly what I am suggesting. I wasn't sure but you're > indicating that 'book_1', 'book_2' and 'book_3' are all the same data types > where the role is going to be table dependent. So you can create a simple > table that maps a book varchar to an identity value. (A unique integer value > with a backing index) This implies that there can only be one id for > 'book_1'. Also you would then want an index on the varchar only for this > table. (One fat index instead of 3) > > By doing this, you do not have any need for anything more than allowing > nulls in the column. Ok, I've tried this. More on this below... > Using your example above, you say that book_1 has children (book_2 and > book_3) Does this not imply that for book_2 that book_1 is its parent? Yes. > So do you add a parent child relationship, or do you keep it null until > there is a metadata record and if the metadata record indicates that there > are no relationships, do you then remove any relationships that might have > been set prior to the new record? When we put book_1 in the db we simply add the tuples (book_1,book_2) and (book_1,book_3) to the relations table. > Yes. My point is that Derby isn't PostgresSQL and there are a lot of > internal factors that could cause the problems you are seeing. How database > engines handle varchars internally could be an issue when concerning > performance. In terms of languages, C is very different from Java. There are > things that you can do in C that you can't do in Java. > > With respect to your problem, I believe that there is an issue with the > query optimizer. By cleaning up your table design the optimizer may make a > better selection. As a last resort, you then apply the hint. > > What I am suggesting is that for testing purposes only, you create the table > that maps a varchar to an identity integer. Then for each row in your base > table, you insert a row in to a new table that stores the integer id instead > of the varchar. You then do the same for the relationship table. > > When you re-run the query you can either join the table in the inner select, > or you could just hardcode the integer value instead of the varchar value. > > Yes there is minimal cost for the join, but you're talking milliseconds. > Weigh that against the time spent on a sequential scan of 1.5million + > records... As I said above, I just tried out your strategy. Using only integer handles the query runs in about 4s. I still need a factor 100 better than that... Instead of using Derby properties we will probably resolve to hacking around this in the code by doing the SELECT first and then iterating manually over the result set doing one UPDATE per row... Before it gets to that I'll give it a final stab tomorrow when I am rested. Cheers, Mikkel