Mailing-List: contact derby-user-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Derby Discussion" <derby-user@db.apache.org>
Received-SPF: pass (nike.apache.org: local policy)
Subject: RE: Bad table design leads to  bad performance...RE: Bad
 performance with UPDATE on a nested SELECT
From: Mikkel Kamstrup Erlandsen <mke@statsbiblioteket.dk>
To: "msegel@segel.com" <msegel@segel.com>
CC: 'Derby Discussion' <derby-user@db.apache.org>
In-Reply-To: <20090114141747.B25FB605EF@dbrack01.segel.com>
References: <20090114141747.B25FB605EF@dbrack01.segel.com>
Content-Type: text/plain
Date: Thu, 15 Jan 2009 09:41:12 +0100
Message-ID: <1232008872.6328.33.camel@pc280>
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

On Wed, 2009-01-14 at 14:15 +0100, derby@segel.com wrote:
> 
> > -----Original Message-----
> > From: Mikkel Kamstrup Erlandsen [mailto:mke@statsbiblioteket.dk]
> > Sent: Wednesday, January 14, 2009 12:12 AM
> > To: Derby Discussion
> > Cc: msegel@segel.com
> > Subject: RE: Bad table design leads to bad performance...RE: Bad
> > performance with UPDATE on a nested SELECT
> [SNIP]
> > >
> > > As I said above, I just tried out your strategy. Using only integer
> > > handles the query runs in about 4s. I still need a factor 100 better
> > > than that...
> > 
> > Sorry, forgot to add that this was on a base with ~700k rows only, not
> > the full 1,5M rows one...
> > 
> > Cheers,
> > Mikkel
> 
> Mikkel,
> 
> Sorry if I'm being a bit slow...
> 
> That 4 seconds was on 700K rows using integers instead of varchars?
> Does that also include the optimizer hint or was that without the optimizer
> hint?

Yes. And it was without the optimizer hint.

> The reason I ask is that if your query was the same and you just switched
> from varchar to integers, then there clearly is an issue with varchars and
> the optimizer. 

Yeah, the query was the exact same, just with INTEGER instead of
VARCHAR(255) for the id column. For reference, the query we are talking
about:

UPDATE records
SET base='my_base'
WHERE id IN (
  SELECT parentId
  FROM relations
  WHERE childId='id_1');

I am not sure that the issue is about varchars. As I stated in an
earlier mail I had a running time of 5-10s on the 700k row base when I
was using varchars. This means that an identity index on an integer
column gives a small speed up (maybe 30% if I am to pull a random number
out of my hat). What I was hoping for was more like 3000%...

> If I understood you correctly, it sounded like when you tried the optimizer
> hint, using varchars, you got the same result as the integers. Is that the
> case?

More or less... The integer case might be slightly faster, but I don't
have precise timings to back this up. In any case the difference here is
very small.

Generally I would say that the optimizer hint doesn't provide any real
difference.

> If you're going to loop through the result set, you may find it faster to
> delete the old row and then perform the insert. Updates are hard(er) on a
> database.

I am just doing a manual loop over the SELECT result set now (it usually
contains 0-5 rows) and doing an update for each row. It works at an
acceptable speed. < 1ms for each op.

> Derby isn't Postgres and there are some neat things you can do in C that you
> can't do in Java that could give a C based database an edge. (Pointers can
> be your friend. ;-) At the same time, there are design decisions that could
> have been made which are hampering performance now. 

And C doesn't have a JIT compiler :-) Java programs can be blazingly
fast. Lucene impresses me on a daily basis :-)

I don't believe this issue is about Java. At this point I am quite
certain that it is a problem with the query optimizer in Derby.

Cheers,
Mikkel