db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raymond Kroeker <raykroe...@gmail.com>
Subject Re: Indexing speed in presence of BLOB columns
Date Sun, 16 Jan 2011 20:55:34 GMT
Hi Daniel,

I've noticed this as well and adjusted my schema patterns in order to 
compensate, and haven't noticed issues since.  For example if I were to 
create tables around a file I'd do the following:

create table FILE(
     ID integer not null,
     PATH varchar(256) not null
);
create table FILE_CONTENT(
     FILE_ID integer not null references FILE(ID),
     CONTENT blob not null
);

Basically the pattern revolves around isolating all binary content into 
its own table.  If you do not, and later you were to add NAME to FILE 
you would run into the same issue if you were to create an index on NAME.

You will also see a performance bottleneck by simply iterating the 
FILE_CONTENT result set even without reading the blob.

If you have existing data, you will not be able to move forward without 
at least one performance hit.  What I'd suggest is to break binary into 
2 tables analogous to what I've done above then create the indicies you 
need.

Raymond

On 11-01-10 11:11 AM, Mike Matrigali wrote:
> Trejkaz wrote:
>> Hi all.
>>
>> I have been doing some performance testing of the time it takes to
>> create an index on a table with BLOB data.  The schema:
>>
>>     CREATE TABLE binary (
>>         id INTEGER NOT NULL,
>>         binary BLOB(1G) NOT NULL
>>     )
>>
>> Each time, 10,000 rows are inserted with data of a given size and then
>> an index is created at the end:
>>
>>     CREATE INDEX binary_id ON binary (id)
>>
>> Times are an average of three runs after some warmup runs which are 
>> not counted.
>>
>> 1 kB blobs:
>>   Index: 567 ms
>>   Insert: 78 ms
>>
>> 10 kB blobs:
>>   Insert: 3954 ms
>>   Index: 515 ms
>>
>> 100 kB blobs:
>>   Insert: 56307 ms
>>   Index: 20591 ms
>>
>> 1 MB blobs:
>>   Insert: 521904 ms
>>   Index: 122527 ms
>>
>> Surprisingly, the larger the data which is present in the BLOB column,
>> the longer it takes to index.  Since the indexing process shouldn't
>> need to read the BLOB data in order to index the row, I am surprised
>> that it slows down when more data is present.  What is going on here
>> exactly?  Is Derby physically copying the BLOB data to a new location
>> and then deleting the original copy instead of performing some kind of
>> cheap move operation?  Is there some way to avoid this (which won't
>> require changing the schema)?
>>
>> Daniel
>>
> This is the expected behavior.  Derby base tables are very "basic" and 
> thus the need for indexes.  In order to do a scan of the entire base 
> table to build the index every page in the base table needs to be read
> from disk.  Even though we don't actually "read" the blob data for 
> creating the index the system still needs to bring in each page from 
> disk to memory to see if it has any rows.  Basically the scan of the 
> base table is to loop from page 1 to the last page and check if each 
> page is a head page with rows and read the rows.  The blob data is 
> stored in the same file as the main pages.


Mime
View raw message