db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <de...@segel.com>
Subject RE: Need Advice on Indexes
Date Fri, 26 Dec 2008 00:32:31 GMT

Sigh.


Sai,

You post a question to this mailing list looking for help. 
The help you are getting is free so take it with a grain of salt.
But also consider that the people who are taking the time to help you are
seasoned database professionals who have a lot of expertise.

You are trying to rationalize a bad design. We know that it is a bad design
because you're here asking for help because your application does not
perform well.

I would suggest that you consider what is being said and then taking it back
to your team so you can fix your design. 

Not having indexes is bad. Fixing this doesn't fix your poor design. It will
improve some performance issues, but you'll not get the most from your
database.

I understand that you're trying to store your log data in to a database so
that you can take advantage of what a database does best.

But if you're going to take the time and effort to do this, why don't you do
it right from the start and do the little extra work to make your life
simpler in the long run.

There's an expression... '... there is more than one way to skin a cat ...'.
This is true when it comes to solving your problems in software.

If you add a column to your user table a emp_status char(1), you can store
values like 'A' for active, or 'N' for non active. Or if your database
supports binary data types, True or False for active or non-active.

You can then have your program get the correct user id based on username and
emp_status. You then use this information when you insert the record in to
your job table. This will 'shrink' the width of the columns. (I'm not going
to suggest alternatives like triggers and stored procedures but they also
work well...)

You can then do the same for your project table.

This should clean up your table design.

Getting in to your questions about indexes...

First, lets take a look at your query:
> select  job_id, proj_id, proj_name, proj_folder, submit_user, 
> run_user, queue_priority, run_priority, submit_time, start_time, 
> end_time, time_taken, job_log, status from dpa_job where start_time>=?
> and start_time<? and status in(?, ?, ?, ?) order by start_time desc
> 

I rewrote it in the following format.

SELECT  job_id, proj_id, proj_name, proj_folder, 
        submit_user, run_user, queue_priority, 
        run_priority, submit_time, start_time, end_time, 
        time_taken, job_log, status 
FROM    dpa_job 
WHERE   start_time>=?
AND     start_time<? 
AND     status IN(?, ?, ?, ?) 
ORDER BY start_time desc

The reason you want to do this is that you now have visual cues to look at
your query and see what is going on.  It doesn't take much more time and if
you make this a habit, you'll get even faster at spotting problems.

The SELECT clause shows you the columns that you're displaying.
The FROM clause shows you where the data is coming from.
The WHERE clause indicate what filters are being used to limit the data. 
(Its really the WHERE, AND, and OR statements.)
This is where you need to focus your attention when you want to create
indexes that your table will use.

In this query, you are filtering on only two columns. So if you want to
improve the query's performance, you want an index that is ordered on one or
both of these columns. Since start_time is more unique than status, you'll
want the index to order first on start_time, and then on status. You may
want to not use status in the compound index unless you really have a lot of
different status types. (It looks like you only have a handful of status
types.

If you have other queries, you'd want to perform the same exercise and see
how the queries filter the datasets from each table and then see if you
already have an index that could be used. If not, then you'd want to add an
index.

An example... 
Suppose you want to find all of the jobs submitted for a given project by a
given user.

Your WHERE clause would look like:
WHERE  proj_id = ?
AND    submit_user = ?

If you want to limit the time stamp range you'd add the following:
AND    start_time >=?
AND    start_time <=?

You could then index on proj_id, submit_user, and possibly start_time in
that order.

(Since you may have an index on start_time, you may want to check to see
which index the optimizer is going to use. It is possible that the optimizer
uses the index on start_time and then filters the data set by the proj_id
and submit_user. This is going to be database dependent.)

Since you have a database and data, you can use ij to create and drop
indexes and test the queries to see how they perform. 

I would suggest that you look at a good relational database design book to
see how to improve performance and how relational databases work.

Way back when, I used C J Date's book in school.
The text I used is way out of print, but here's a list of CJ Date's books
that can be found on Amazon:
http://www.amazon.com/s/qid=1230251379/ref=sr_pg_2?ie=UTF8&rs=1000&sort=date
rank&keywords=Relational%20Database&rh=i%3Astripbooks%2Ck%3ARelational%20Dat
abase%2Cp_27%3AC.%20J.%20Date&page=1

This may be a good start:
http://www.amazon.com/Database-Depth-Relational-Theory-Practitioners/dp/0596
100124/ref=sr_1_6?ie=UTF8&s=books&qid=1230251480&sr=1-6

HTH

-Mike

> -----Original Message-----
> From: Sai Pullabhotla [mailto:sai.pullabhotla@jmethods.com]
> Sent: Thursday, December 25, 2008 11:06 AM
> To: Derby Discussion
> Subject: Re: Need Advice on Indexes
> 
> Mike,
> 
> We went through all of these during the design phase and decided that
> we do not have to do the kind of normalization you have mentioned.
> Below are a couple of reasons:
> 
> As stated before, my jobs table is nothing more than a plain LOG file
> which is stored in a database table instead of a regular file on the
> file system. The only reason why we needed to store it in the database
> is for easy search/filtering of records. If I were to use a plain log
> file, I don't think I would want to print a user ID (a number) to a
> log file which does not make much sense to the users reading it. I
> guess the same holds true regardless of the underlying data store
> (plain file or database table or something else).
> 
> About not really deleting a user record and having a "status" field in
> the user table, we did go through this and here are the issues with
> this approach:
> Let us say my table definition is -
> CREATE TABLE user(
>     user_id integer primary key,
>     user_name char(20),
>     full_name varchar(50));
> 
> We do want to have a unique constraint on the user_name field. If we
> do not purge the user record when a delete is requested, then I cannot
> have a new user with the same name as a user that was deleted 2 years
> ago. Same is true even if you create a composite unique key with
> user_name and status fields. Correct me if I'm wrong.
> 
> You said that -
> 
> > Based on this query, your index would never be used.
> 
> Why not and how do I make Derby to use my index in various scenarios
> I've mentioned.
> 
> Thanks.
> Sai Pullabhotla
> Phone: (402) 408-5753
> Fax: (402) 408-6861
> www.jMethods.com
> 
> 
> 
> 
> On Wed, Dec 24, 2008 at 4:10 PM,  <derby@segel.com> wrote:
> > Sai, please see my comments in yours...
> >
> >> -----Original Message-----
> >> From: Sai Pullabhotla [mailto:sai.pullabhotla@jmethods.com]
> >> Sent: Wednesday, December 24, 2008 1:54 PM
> >> To: Derby Discussion; msegel@segel.com
> >> Subject: Re: Need Advice on Indexes
> >>
> >> Rick and Mike,
> >>
> >> Thanks for taking time to reply to my question.
> >>
> >> As far as normalizing the table that Mike suggested -
> >>
> >> this jobs table is nothing more than a history table. So, a project or
> >> user exists today but may not exist tomorrow. However, we want to
> >> maintain the history of every project that was executed (this is
> >> considered a job and hence the name). Since, a user may not exist in a
> >> future time, I can not maintain a foreign key to my user table even
> >> though I've a user table and an ID column in it. Same with projects. I
> >> hope it makes sense now why this table is a flat table. However, I
> >> think I can move the remarks column into its own table and use a
> >> longvarchar or clob field.
> >>
> > Ok,
> > That's not a good design. Your user information should be stored
> separately
> > and have an id associated with it. You don't want to purge this
> information;
> > however you can associate a status field as part of this table. So you
> > retain and don't reuse the ids and you can still keep historical data
> > intact.
> >
> > One of the problems you may have is how you populate the field with the
> > employee's name. What happens if you have two employees with the same
> name?
> > (Yes it happens. ;-)
> > The other problem is what happens when the person entering the name
> > misspells the name? (Segel != Segal,Siegel,Siegal,Seagull) :-)
> >
> > In addition, your table isn't normalized when you clearly have a 1 to
> many
> > relationship of data. Conceptually there are multiple jobs to a project.
> So
> > when you store jobs and projects in the same data, its not normalized.
> >
> > Note: There are times when you break from using 3rd normal form, however
> it
> > requires a lot of practical experience to justify it.
> >
> >
> >> Now, as far as the queries go, Most of the times, it is -
> >>
> >> select  job_id, proj_id, proj_name, proj_folder, submit_user,
> >> run_user, queue_priority, run_priority, submit_time, start_time,
> >> end_time, time_taken, job_log, status from dpa_job where start_time>=?
> >> and start_time<? and status in(?, ?, ?, ?) order by start_time desc
> >>
> > Ok,
> > Not to nitpick, but I'm a bit old school and when writing a query I tend
> to
> > want to see it formatted to be easier to read. Taking your query...
> >
> > SELECT  job_id, proj_id, proj_name, proj_folder,
> >        submit_user, run_user, queue_priority,
> >        run_priority, submit_time, start_time, end_time,
> >        time_taken, job_log, status
> > FROM    dpa_job
> > WHERE   start_time>=?
> > AND     start_time<?
> > AND     status IN(?, ?, ?, ?)
> > ORDER BY start_time desc
> >
> > It's a bit easier to read. The key words are capitalized and you can see
> > your queries.
> >
> > Now looking at your common query example, the only thing you need to
> index
> > on is the start time column. You can add status to the query, however
> its
> > not going to really impact the performance of the query... unless you
> really
> > have a lot of rows. (Like more than 10,000 rows in the result set if you
> > ignore the status filter.)
> >
> > Does the IN clause effect the performance? There was an issue earlier in
> > Derby, but I think it has been fixed for a couple of versions. In your
> > example query with only 4 values, it is not that big of an issue.
> >
> > Based on this query, your index would never be used.
> >
> > You've indicated that this is a production database with customers using
> it.
> > I'm going to strongly suggest that your company hire a competent DBA.
> You've
> > said a couple of things that are major red flags.  When I say DBA, I
> mean
> > someone who could be considered a logical DBA and has both data modeling
> and
> > app development expertise. If not a FTE, get a consultant in to do a
> health
> > check.
> >
> >
> >
> >> Sai Pullabhotla
> >> Phone: (402) 408-5753
> >> Fax: (402) 408-6861
> >> www.jMethods.com
> >>
> >>
> >>
> >>
> >
> >
> >
> >



Mime
View raw message