db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikkel Kamstrup Erlandsen <...@statsbiblioteket.dk>
Subject Re: Bad table design leads to bad performance...RE: Bad performance with UPDATE on a nested SELECT
Date Tue, 13 Jan 2009 14:50:51 GMT
On Tue, 2009-01-13 at 15:19 +0100, Michael Segel wrote:
> 
> > -----Original Message-----
> > From: Knut.Hatlen@Sun.COM [mailto:Knut.Hatlen@Sun.COM]
> > Sent: Tuesday, January 13, 2009 4:59 AM
> > To: Derby Discussion
> > Subject: Re: Bad performance with UPDATE on a nested SELECT
> [SNIP]
> > Yes, it looks like the same query plan. I see the following problems
> > with this query plan:
> > 
> > 1) The IN clause is rewritten to a join with SUMMA_RECORDS as the outer
> > table, and we have no information to limit the scan of the outer table,
> > hence a full table scan is performed. It would probably have been better
> > if SUMMA_RECORDS were the inner table. Then the C index could be used to
> > enforce the restriction (childId='...') on the outer table,
> > SUMMA_RELATIONS, and index I could be used to perform lookups in the
> > inner table.
> > 
> > 2) The index scan on SUMMA_RELATIONS uses the index PC. Since that index
> > is on the columns (parentId, childId), it is primarily sorted on
> > parentId, so the scan needs to go through the entire index, for each row
> > in the outer table, in order to match childId with the criterion in the
> > WHERE clause of the nested select. It would probably be better to use
> > the index on childId (index C) instead of PC (or reverse the order of
> > the columns in the PC index).
> > 
> > Derby allows you to override some of the optimizer's decisions, but it's
> > not as easy for UPDATE statements as it is for SELECT statements, so I
> > don't know how to tell it not to pick the table scan. Using the C index
> > instead of the PC index should be easy enough, though:
> > 
> > UPDATE summa_records SET base='my_base' WHERE id IN
> >   (SELECT parentId FROM summa_relations --DERBY-PROPERTIES index=C
> >       WHERE childId='horizon_2332668')
> > 
> > --
> > Knut Anders
> 
> I think that you're missing something.
> 
> First without knowing the whole application, why is there an index PC
> instead of an Index P and then an index C? The relationship match table has
> just two columns. So why do you need a compound index? It doesn't make
> sense.

The idea was to have a unique index on the 'relations' table, but maybe
you are right...

> Will a parent have over 10,000 children? Then maybe it might make sense.
> (Well not really. If that was the case, you'd probably want to use a temp
> table. Assuming that Derby support the dynamic creation of temp tables like
> Informix, but not like Oracle or DB2... )

A typical case is that parents might have 0-10 children (and reverse -
children might have 0-10 parents).

> I suggested that you redesign your tables so that instead of working with
> varchars and indexing varchars all over the place, you instead create a
> table so you can do a lookup of the varchar, matching to an integer ID. 
> 
> Indexing varchars in a main table is a *bad* idea. You end up with a lot of
> fat indexes. Fat indexs don't perform as well as skinny indexes, especially
> when you're talking about tables with millions plus rows.

It will not be trivial for us to "just" use integer ids. The problem is
that either end of a relation will not necessarily exist in the base. 

This means that if we use proxy integer ids we'd have to insert a
dummy-id in the relations table that me later must pair up to the real
id if a record with linked id shows up.

To be concrete; I am working at a library and the "records" are books
linking each other. A case could be that we are instructed to put the
following metadata in the base:

  id: book_1
  children: book_2, book_3
  parents:

7 days later we receive metadata for book_2:

  id: book_2
  children:
  parents:

Note that book_2 does not mention its relation to book_1. This is
(sadly) legal in our world.

I can see that your strategy can work out if we store
integer_id->varchar_id in a separate table. Ids for non-existing records
can just be added here as well. I am just concerned that this will bring
even more JOIN overhead than we already have...

> Also do not use query optimization statements except as a last resort. 
> 
> Again I *highly* recommend creating a table where you assign a unique id to
> a given 'parentId'/'id'/'childId' and then use that id in these tables
> instead of the varchar. 1.5 million rows of an index on a varchar column
> means a lot of wasted space and overhead. (Just to re-iterate the point made
> above.)
> 
> Look, no disrespect... But a lot of times, improving your table design will
> improve your performance and at the same time keeping your solution database
> agnostic.

Your input is much appreciated and provides good food for thought :-)
The problem is just that we have a PostgresQL doing this exact query (on
the same data amounts) in < 1ms, but we would really, _really_ like to
use Derby instead.

Cheers,
Mikkel


Mime
View raw message