db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <de...@segel.com>
Subject RE: Need Advice on Indexes
Date Wed, 24 Dec 2008 17:57:08 GMT
Actually its not all that hard. :-)


<soap box> Too many people are acting like 'physical' DBAs without
understanding how a relational database or index works. Are there any good
online tutorials which will 'teach' the basics? </soap box>

<caveat> What I am writing is a generic statement or guideline to
table/index design in general. It is not specific to derby/cloudscape/javadb
and should be taken with a grain of salt. I am sure that there will be
someone who disagrees and will tell you something different. As always YMMV.
</caveat>

Just on the surface a couple of things ...

First, even without knowing your application or the rest of your database, I
will say that you first need to normalize your data. Performance wise, this
is pretty important.

You're mixing and matching project information and job information in to one
table. Also, just a hint. Its more efficient to store a user_id than the
entire user name. (At most, a 4 byte word vs a varchar of 1-4 byte
characters per character. [Unicode UTF-8 character set for
internationalization]) Even considering ASCII, 'John Smith' on every row
takes up more space than employee_id 112. And since you're tracking both the
submit user id and run user id, that's twice the savings per row. ;-)
So you'll want to create look up tables and create more of a star schema
than just your flat table. 

Second, your index is almost as large as your table. You're creating an
index on what looks like 10 of the 16 columns? (I probably miscounted.) 
This isn't a good idea. You'll see this sort of thing done because some
engines (maybe all) will only search the index if all of the columns of the
query are also in the index, so the idea is that you'll save yourself some
seek times.

Just a rule of thumb, in relational databases you really don't want to make
it a habit of joining more than 3 or 4 tables together. With respect to
indexes, when you get beyond 3 or 4 columns you're not going to get much
more unique values.  The key is to use the index to limit your search for
values, however once you get down to a reasonable subset, the database will
be doing a sequential scan of your subset. In fact another rule of thumb is
that if your table has less than 10,000 rows, most engines do not use an
index.

Another issue. Having large indexes will kill your performance on inserts
and updates. (You add/change a row in the table, your index has to be
changed as well.) So unless you have a situation where you have lots of
reads and little updates/inserts, you're going to have performance issues.

Also, in general, when you do a query on a table, most databases will only
use one index. (Some databases like XPS or other data warehouse centric
databases will use multiple indexes on the data.)
The other big issue is that if your query is sorting on the 3rd or 4th
column of the index, most likely the optimizer will not use the index.
Using your index, that means if you have a statement like:

SELECT *
FROM   job
WHERE  submit_user = "JOHN DOE"

The database will not use the index but will do a sequential table scan.

In looking at your example, its easy to see that you could split this in to
a couple of different indexes.

As Rick points out, without knowing your queries, its hard to say what
indexes you'll need.

First, split the table to normalize the data.
Figure out the indexes needed for joining the table and also how you're
going to query for a job.

In addition... if you're really looking for the best performance, think
about how often you'll actually look at a column in the join. You have a lot
of varchar columns. Are they going to be long enough? What's the minimum
you're going to be storing in the actual table (min length usually gets
stored in the table while the rest is referenced from a different
tablespace/page). If this table is constantly being used, but you're not
actually looking at the remarks or comments (large varchar columns), you'll
want to think about putting them in a separate table. Or even converting
them to a text blob so that you don't have an artificial limitation.
(Assuming that you'll want or need more that 512 characters in a comment.)
Table width is still a factor, albeit, with each generation of advancements
in hard drive, memory, I/O, this becomes less and less of an issue.

The separating out the varchar is really, really necessary if you're doing a
lot of look ups on the non varchar columns and need to squeeze the most out
of your performance. 

Does this help?

BTW, I think its great that you're thinking about compound indexes. (indexes
using multiple columns.) Too often developers only think about a single
column or the framework only creates a single column index for you. Then you
end up with a lot of indexes and when you create your own indexes that are
compound indexes, you'll need to add optimizer hints in your query
statements. (If your database supports optimizer hints.)

-Mike

PS. Sorry if this seems a bit disjointed. I'm trying to write this as my
wife is bossing me around the house. :-)

Happy Holidays!


> -----Original Message-----
> From: Richard.Hillegas@Sun.COM [mailto:Richard.Hillegas@Sun.COM]
> Sent: Wednesday, December 24, 2008 7:50 AM
> To: Derby Discussion
> Subject: Re: Need Advice on Indexes
> 
> Hi Sai,
> 
> It's hard to answer your question without seeing your queries. However,
> for an index to be useful, some subset of the restricted columns in your
> query must form a leading prefix of the indexed columns. So for
> instance, your index would not be useful for the following query:
> 
>   select * from job where proj_name  = 'Plan 9'
> 
> because the leading column of the index, proj_id, is not restricted by
> the query. If the timestamp ranges are always necessary in your queries,
> then I would make them the leading columns in the index.
> 
> Hope this helps,
> -Rick
> 
> Sai Pullabhotla wrote:
> > Dear Developers,
> >
> > I currently have the following table and index in a Derby database.
> >
> > create table job(
> > 	job_id bigint not null,
> > 	proj_id integer not null,
> > 	proj_name varchar(50) not null,
> > 	proj_folder varchar(256) not null,
> > 	submit_user varchar(20) not null,
> > 	run_user varchar(20) not null,
> > 	queue_priority integer not null,
> > 	run_priority integer not null,
> > 	submit_time timestamp not null,
> > 	start_time timestamp not null,
> > 	end_time timestamp,
> > 	time_taken bigint,
> > 	job_log varchar(256) not null,
> > 	status char(1) default 'A' not null,
> > 	remarks varchar(512),
> > 	sched_job_id integer,
> >
> > 	primary key(job_id),
> > 	check(status in('A', 'S', 'F', 'C'))
> > );
> >
> > create index idx_job on job (proj_id desc, proj_name desc, proj_folder
> > desc, start_time desc, end_time desc, run_user desc, time_taken desc,
> > status desc, sched_job_id desc);
> >
> > This table would typically have a few hundred thousand records
> > (~500,000). Records should be searchable by a combination of date
> > range (start_time), project name, project folder, run user, status
> > (one or more status codes using IN clause). To further clarify this
> > search criteria, date range is always required and the rest of the
> > parameters are optional.
> >
> > There is also a need to search records by job_id alone as well as
> > sched_job_id alone. In the first case, there will be at most one
> > result, but in the second case there could be several.
> >
> > The search results should be sortable pretty much by any column except
> > remarks and job log.
> >
> > We also have a need to display the job count based on the status code
> > for today, yesterday, last week and last month. In other words, the
> > last week statistics would show how many jobs were successful, how
> > many were failed etc. in the past one week.
> >
> > I'm wondering if the above index I've is good enough to perform all
> > these functions fast enough. Please advise.
> >
> > Regards,
> >
> > Sai Pullabhotla
> > Phone: (402) 408-5753
> > Fax: (402) 408-6861
> > www.jMethods.com
> >




Mime
View raw message