db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sai Pullabhotla" <sai.pullabho...@jmethods.com>
Subject Re: Need Advice on Indexes
Date Wed, 24 Dec 2008 19:54:22 GMT
Rick and Mike,

Thanks for taking time to reply to my question.

As far as normalizing the table that Mike suggested -

this jobs table is nothing more than a history table. So, a project or
user exists today but may not exist tomorrow. However, we want to
maintain the history of every project that was executed (this is
considered a job and hence the name). Since, a user may not exist in a
future time, I can not maintain a foreign key to my user table even
though I've a user table and an ID column in it. Same with projects. I
hope it makes sense now why this table is a flat table. However, I
think I can move the remarks column into its own table and use a
longvarchar or clob field.

Now, as far as the queries go, Most of the times, it is -

select  job_id, proj_id, proj_name, proj_folder, submit_user,
run_user, queue_priority, run_priority, submit_time, start_time,
end_time, time_taken, job_log, status from dpa_job where start_time>=?
and start_time<? and status in(?, ?, ?, ?) order by start_time desc

As stated in my original email, The date range will always be there. I
can get rid of the "and status in(?,?,?,?) if the user does not care
about the job's status. But, if the user selected one or more status
types, that IN clause needs to stay there. Not sure if that IN effects
performance and how.

Now another variation of the query (let us say, the user fills in
every parameter on the search screen), then the query looks like:

select  job_id, proj_id, proj_name, proj_folder, submit_user,
run_user, queue_priority, run_priority, submit_time, start_time,
end_time, time_taken, job_log, status from dpa_job where start_time>=?
and start_time<? and status in(?, ?) and run_user=? and proj_folder=?
and proj_name=? order by start_time desc

Now, the order by clause will have a different column if the user
chooses a custom sorting. But it is always by one column.

Another query we do is when a user wants to see the complete execution
history of a project (the project name could have changed since its
creation too). That's why we maintain the project ID in the job table.
This query looks like:

select  job_id, proj_id, proj_name, proj_folder, submit_user,
run_user, queue_priority, run_priority, submit_time, start_time,
end_time, time_taken, job_log, status from dpa_job where proj_id=? and
start_time>=? and start_time<? and status in(?, ?, ?, ?) order by
start_time desc

And finally, the select statements for showing the quick stats (show
the job count for today, yesterday, last week etc) here is the
statement:

select count(job_id) from dpa_job where start_time >= ? and start_time
< ? and status=?

This statement shows how many jobs were successful, failed, canceled
in the given date range.

Now the main reason I posted this question is because of the last
query (the quick stats) which one of our customer complained that it
takes too long like 5 to 10 minutes when he had about 500,000 records.
I did not notice my system taking that long when I tested with
300,000+ records.

Hope I've provided enough information for you. Let me know if you need
additional information.

Your help is greatly appreciated.

Sai Pullabhotla
Phone: (402) 408-5753
Fax: (402) 408-6861
www.jMethods.com




On Wed, Dec 24, 2008 at 11:57 AM,  <derby@segel.com> wrote:
> Actually its not all that hard. :-)
>
>
> <soap box> Too many people are acting like 'physical' DBAs without
> understanding how a relational database or index works. Are there any good
> online tutorials which will 'teach' the basics? </soap box>
>
> <caveat> What I am writing is a generic statement or guideline to
> table/index design in general. It is not specific to derby/cloudscape/javadb
> and should be taken with a grain of salt. I am sure that there will be
> someone who disagrees and will tell you something different. As always YMMV.
> </caveat>
>
> Just on the surface a couple of things ...
>
> First, even without knowing your application or the rest of your database, I
> will say that you first need to normalize your data. Performance wise, this
> is pretty important.
>
> You're mixing and matching project information and job information in to one
> table. Also, just a hint. Its more efficient to store a user_id than the
> entire user name. (At most, a 4 byte word vs a varchar of 1-4 byte
> characters per character. [Unicode UTF-8 character set for
> internationalization]) Even considering ASCII, 'John Smith' on every row
> takes up more space than employee_id 112. And since you're tracking both the
> submit user id and run user id, that's twice the savings per row. ;-)
> So you'll want to create look up tables and create more of a star schema
> than just your flat table.
>
> Second, your index is almost as large as your table. You're creating an
> index on what looks like 10 of the 16 columns? (I probably miscounted.)
> This isn't a good idea. You'll see this sort of thing done because some
> engines (maybe all) will only search the index if all of the columns of the
> query are also in the index, so the idea is that you'll save yourself some
> seek times.
>
> Just a rule of thumb, in relational databases you really don't want to make
> it a habit of joining more than 3 or 4 tables together. With respect to
> indexes, when you get beyond 3 or 4 columns you're not going to get much
> more unique values.  The key is to use the index to limit your search for
> values, however once you get down to a reasonable subset, the database will
> be doing a sequential scan of your subset. In fact another rule of thumb is
> that if your table has less than 10,000 rows, most engines do not use an
> index.
>
> Another issue. Having large indexes will kill your performance on inserts
> and updates. (You add/change a row in the table, your index has to be
> changed as well.) So unless you have a situation where you have lots of
> reads and little updates/inserts, you're going to have performance issues.
>
> Also, in general, when you do a query on a table, most databases will only
> use one index. (Some databases like XPS or other data warehouse centric
> databases will use multiple indexes on the data.)
> The other big issue is that if your query is sorting on the 3rd or 4th
> column of the index, most likely the optimizer will not use the index.
> Using your index, that means if you have a statement like:
>
> SELECT *
> FROM   job
> WHERE  submit_user = "JOHN DOE"
>
> The database will not use the index but will do a sequential table scan.
>
> In looking at your example, its easy to see that you could split this in to
> a couple of different indexes.
>
> As Rick points out, without knowing your queries, its hard to say what
> indexes you'll need.
>
> First, split the table to normalize the data.
> Figure out the indexes needed for joining the table and also how you're
> going to query for a job.
>
> In addition... if you're really looking for the best performance, think
> about how often you'll actually look at a column in the join. You have a lot
> of varchar columns. Are they going to be long enough? What's the minimum
> you're going to be storing in the actual table (min length usually gets
> stored in the table while the rest is referenced from a different
> tablespace/page). If this table is constantly being used, but you're not
> actually looking at the remarks or comments (large varchar columns), you'll
> want to think about putting them in a separate table. Or even converting
> them to a text blob so that you don't have an artificial limitation.
> (Assuming that you'll want or need more that 512 characters in a comment.)
> Table width is still a factor, albeit, with each generation of advancements
> in hard drive, memory, I/O, this becomes less and less of an issue.
>
> The separating out the varchar is really, really necessary if you're doing a
> lot of look ups on the non varchar columns and need to squeeze the most out
> of your performance.
>
> Does this help?
>
> BTW, I think its great that you're thinking about compound indexes. (indexes
> using multiple columns.) Too often developers only think about a single
> column or the framework only creates a single column index for you. Then you
> end up with a lot of indexes and when you create your own indexes that are
> compound indexes, you'll need to add optimizer hints in your query
> statements. (If your database supports optimizer hints.)
>
> -Mike
>
> PS. Sorry if this seems a bit disjointed. I'm trying to write this as my
> wife is bossing me around the house. :-)
>
> Happy Holidays!
>
>
>> -----Original Message-----
>> From: Richard.Hillegas@Sun.COM [mailto:Richard.Hillegas@Sun.COM]
>> Sent: Wednesday, December 24, 2008 7:50 AM
>> To: Derby Discussion
>> Subject: Re: Need Advice on Indexes
>>
>> Hi Sai,
>>
>> It's hard to answer your question without seeing your queries. However,
>> for an index to be useful, some subset of the restricted columns in your
>> query must form a leading prefix of the indexed columns. So for
>> instance, your index would not be useful for the following query:
>>
>>   select * from job where proj_name  = 'Plan 9'
>>
>> because the leading column of the index, proj_id, is not restricted by
>> the query. If the timestamp ranges are always necessary in your queries,
>> then I would make them the leading columns in the index.
>>
>> Hope this helps,
>> -Rick
>>
>> Sai Pullabhotla wrote:
>> > Dear Developers,
>> >
>> > I currently have the following table and index in a Derby database.
>> >
>> > create table job(
>> >     job_id bigint not null,
>> >     proj_id integer not null,
>> >     proj_name varchar(50) not null,
>> >     proj_folder varchar(256) not null,
>> >     submit_user varchar(20) not null,
>> >     run_user varchar(20) not null,
>> >     queue_priority integer not null,
>> >     run_priority integer not null,
>> >     submit_time timestamp not null,
>> >     start_time timestamp not null,
>> >     end_time timestamp,
>> >     time_taken bigint,
>> >     job_log varchar(256) not null,
>> >     status char(1) default 'A' not null,
>> >     remarks varchar(512),
>> >     sched_job_id integer,
>> >
>> >     primary key(job_id),
>> >     check(status in('A', 'S', 'F', 'C'))
>> > );
>> >
>> > create index idx_job on job (proj_id desc, proj_name desc, proj_folder
>> > desc, start_time desc, end_time desc, run_user desc, time_taken desc,
>> > status desc, sched_job_id desc);
>> >
>> > This table would typically have a few hundred thousand records
>> > (~500,000). Records should be searchable by a combination of date
>> > range (start_time), project name, project folder, run user, status
>> > (one or more status codes using IN clause). To further clarify this
>> > search criteria, date range is always required and the rest of the
>> > parameters are optional.
>> >
>> > There is also a need to search records by job_id alone as well as
>> > sched_job_id alone. In the first case, there will be at most one
>> > result, but in the second case there could be several.
>> >
>> > The search results should be sortable pretty much by any column except
>> > remarks and job log.
>> >
>> > We also have a need to display the job count based on the status code
>> > for today, yesterday, last week and last month. In other words, the
>> > last week statistics would show how many jobs were successful, how
>> > many were failed etc. in the past one week.
>> >
>> > I'm wondering if the above index I've is good enough to perform all
>> > these functions fast enough. Please advise.
>> >
>> > Regards,
>> >
>> > Sai Pullabhotla
>> > Phone: (402) 408-5753
>> > Fax: (402) 408-6861
>> > www.jMethods.com
>> >
>
>
>
>

Mime
View raw message