db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Segel" <mse...@segel.com>
Subject RE: Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT
Date Tue, 13 Jan 2009 15:59:30 GMT

See my comments mixed in...

Just a caveat, you're right, I don't know your entire application, I'm just
looking at it from what little information you have shared in your posts....

> -----Original Message-----
> From: Mikkel Kamstrup Erlandsen [mailto:mke@statsbiblioteket.dk]
> Sent: Tuesday, January 13, 2009 8:51 AM
> To: Derby Discussion; msegel@segel.com
> Subject: Re: Bad table design leads to bad performance...RE: Bad
> performance with UPDATE on a nested SELECT
> It will not be trivial for us to "just" use integer ids. The problem is
> that either end of a relation will not necessarily exist in the base.

In your update statement, as written, unless there is a parent to child
relationship established, you will not update a base record. Your statement
is matching the base record's id on a match to a parentId which has a
specific child record.

This means that you *can* have a base record that will not get updated.

> This means that if we use proxy integer ids we'd have to insert a
> dummy-id in the relations table that me later must pair up to the real
> id if a record with linked id shows up.
This is not the case.
(See below)

> To be concrete; I am working at a library and the "records" are books
> linking each other. A case could be that we are instructed to put the
> following metadata in the base:
>   id: book_1
>   children: book_2, book_3
>   parents:
> 7 days later we receive metadata for book_2:
>   id: book_2
>   children:
>   parents:
> Note that book_2 does not mention its relation to book_1. This is
> (sadly) legal in our world.
> I can see that your strategy can work out if we store
> integer_id->varchar_id in a separate table. Ids for non-existing records
> can just be added here as well. I am just concerned that this will bring
> even more JOIN overhead than we already have...

Yes! That is exactly what I am suggesting. I wasn't sure but you're
indicating that 'book_1', 'book_2' and 'book_3' are all the same data types
where the role is going to be table dependent. So you can create a simple
table that maps a book varchar to an identity value. (A unique integer value
with a backing index) This implies that there can only be one id for
'book_1'. Also you would then want an index on the varchar only for this
table. (One fat index instead of 3)

By doing this, you do not have any need for anything more than allowing
nulls in the column. 

Using your example above, you say that book_1 has children (book_2 and
book_3) Does this not imply that for book_2 that book_1 is its parent?
So do you add a parent child relationship, or do you keep it null until
there is a metadata record and if the metadata record indicates that there
are no relationships, do you then remove any relationships that might have
been set prior to the new record?

> > Also do not use query optimization statements except as a last resort.
> >
> > Again I *highly* recommend creating a table where you assign a unique id
> to
> > a given 'parentId'/'id'/'childId' and then use that id in these tables
> > instead of the varchar. 1.5 million rows of an index on a varchar column
> > means a lot of wasted space and overhead. (Just to re-iterate the point
> made
> > above.)
> >
> > Look, no disrespect... But a lot of times, improving your table design
> will
> > improve your performance and at the same time keeping your solution
> database
> > agnostic.
> Your input is much appreciated and provides good food for thought :-)
> The problem is just that we have a PostgresQL doing this exact query (on
> the same data amounts) in < 1ms, but we would really, _really_ like to
> use Derby instead.
> Cheers,
> Mikkel

Yes. My point is that Derby isn't PostgresSQL and there are a lot of
internal factors that could cause the problems you are seeing. How database
engines handle varchars internally could be an issue when concerning
performance. In terms of languages, C is very different from Java. There are
things that you can do in C that you can't do in Java.

With respect to your problem, I believe that there is an issue with the
query optimizer. By cleaning up your table design the optimizer may make a
better selection. As a last resort, you then apply the hint.

What I am suggesting is that for testing purposes only, you create the table
that maps a varchar to an identity integer. Then for each row in your base
table, you insert a row in to a new table that stores the integer id instead
of the varchar. You then do the same for the relationship table.

When you re-run the query you can either join the table in the inner select,
or you could just hardcode the integer value instead of the varchar value.

Yes there is minimal cost for the join, but you're talking milliseconds.
Weigh that against the time spent on a sequential scan of 1.5million +


View raw message