db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <de...@segel.com>
Subject RE: Strange performance issue with BLOB's
Date Wed, 12 Jul 2006 19:01:33 GMT
Ewww!

Ok, so that will kill your performance.

[More below]

> -----Original Message-----
> From: Mike Matrigali [mailto:mikem_app@sbcglobal.net]
> Sent: Wednesday, July 12, 2006 12:40 PM
> To: Derby Discussion
> Subject: Re: Strange performance issue with BLOB's
> 
> I think the following is what is going on. The blobs are not
> really instantiated in memory in one sense, but the entire
> container is read through the cache a page at a time.
> 
> The base container of a derby table is a sequence of pages.  Derby
> stores blob data in the same container as the rest of the data,
> in a page link chain.  There is a map of pages used only for
> space allocation, not page types.  So in the scan case Derby
> visits every page in the container and then checks whether each
> page is a "main" page or not.  Main pages are returned up the
> stack to access to process each row.  In the case of your
> query no blob data is actually accessed so "access" never requests
> the column data for the blob data - but the whole table is
> still read through the cache.  There is no support in the base
> container implementation to "skip" to the next main page - that
> is what indexes provide.
> 
[mjs] 
So its trying to load the pages(s) associated with the 27 blobs, even though
it doesn't have anything to do with the query?

Maybe a simple fix would be to segment the pages in the container. Store the
relational information in the pages at the "top" of the container, and a
reference to the first page of "Blob space". Then each row would then store
a reference to the blob pages that are relative to the start of the
Blobspace.

> This is an interesting case for the optimizer.  My guess is that
> the optimizer never considers using an index if there is no
> where clause unless the index "covers" (includes all the columns
> of the query).  I am not sure how hard it would be to get the
> optimizer to consider using the index in this case, and I am
> not sure if the existing costing handles this blob case - but
> from your experiment is sort of looks like the costing is right
> as the optimizer is choosing the index when you added the
> dummy where - I am sure the costing of that where indicated that
> it would scan the entire index.
> 
[mjs] Righ.
But what index would it use if there were no hints in the query itself?
So in this respect, I'd say that Derby was doing the right thing by doing a
sequential scan.

>From a performance perspective, many other databases will ignore indexes and
do sequential scans on small tables.... 

> With performance issues like this a lot is made much more obvious
> if you can include the query plan of the 2 queries.
> 
[SNIP]
[mjs] 
I wouldn't call this a bug per se, but a design issue.
I would suggest opening up a JIRA case to focus on the data storage rather
than the optimizer.

Clearly Derby's storage of Blobs is inefficient.

So, the key would be to make improvements on Derby's storage, that would
have minimal impact on existing implementations.

If we extend the container concept to be more intelligent about the data and
its pages in the "table", we can greatly improve performance.  By segmenting
the blobs away from the rest of the table data, we could increase the
efficiency of the table, with the only sacrifice being an extra jump to get
to the front of the blob. (Relatively speaking, this cost would be minimal
considering that we're talking about blobs that for the most part span
multiple pages.

But hey! What do I know? I'm just talking about theory. ;-)

-G




Mime
View raw message