db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bergquist, Brett" <BBergqu...@canoga.com>
Subject RE: Could use some thoughts on a problem with a query in production
Date Fri, 09 Jun 2017 15:33:56 GMT
I agree Bryan.   I am trying both routes in going to provide some more detailed monitor at
the customer site as well as working on trying to simulate the same access pattern and hopefully
get something reproducible.

So more insight.   This table would have had a high insert rate and probably around 60M records
in it indexed by a composite key which includes a timestamp so basically this table has an
index in the correct order for querying.  Insertion is about 10M records/day.

The problem query should be accessing about 160K records in its result set.

We have configured a large page cache (64000 pages).

In this scenario, the query happens on a Sunday morning around 1:45 AM and is querying for
Saturday’s data from 1:00 AM to 2:00 AM.   In the meantime, there are several other accesses
that have queried this same table throughout Saturday and Sunday for more recent time intervals.
  For example, one access pattern is looking at data every 5 minutes for the previous 15 minute
interval.    So it seems to me that these query access patterns would be in pages in the cache.

Now along comes the problem query and it is much larger (1 hours worth of data) and much older
(yesterday’s data) so that data is probably not going to be in the cache, and is going to
try to populate the cache with the pages from the query, triggering the “clear”.

The CPU utilization is a key I believe as that implies some sort of “busy” code.   It
almost looks like a “busy wait” being performed inside of a loop.

Another key is that the other queries hitting the same table but different (recent) timespans
do no hit this problem even while the problem queries are tied up in knots.   This leads me
to believe it is not a statistics problem with the table.  Also the stack traces show no “bulk
table scan” but rather “index read” activity.

Another clue is a copy of this database while the condition was occurring did not show the
issue.   The copy is obtained by “freezing” the database and performing a ZFS filesystem
snapshot of the database files, and then “unfreezing” the database.   Firing up this copy
of the database means the database is “cold” and will not have cached anything.   The
problem query works fine on this copy of the database.

The problem query that is going on is relatively new to the production system and it is the
only one that is querying for relatively “old” data.  Again, this leads me down the path
to something to do with the cache.

I am just throwing out there some thoughts with my detective hat on to see if anything might
trigger some thoughts.

So I have two plans of attack.

The first is to prepare a monitor script that can be execute around the time this issue happens
on the customer’s system and hope that it in fact does happen again and the monitor script
can get some more detailed information.

The second is to replay write an application that can replay the queries for a couple of days
without any “think” time.  This should hit the database from a query point of view in
the same manner as what happens on the customer’s system.  Hopefully this will trigger a
reproducible problem.

From: Bryan Pendleton [mailto:bpendleton.derby@gmail.com]
Sent: Wednesday, June 07, 2017 9:47 AM
To: derby-dev@db.apache.org
Subject: Re: Could use some thoughts on a problem with a query in production

Hi Brett,

Cache cleaning in general is a very I/O intensive activity, so I agree that it is odd that
your system appeared to be CPU busy during that time.

It's interesting that whatever it was, persisted for such a long time.

In the past, I have developed small operator scripts which collect system information, which
can be run by a simple monitoring tool and can keep a history about the observations they
make.

If you were to build and install such a monitoring system, you might be able to develop more
clues about the "larger picture" of activity on your machine during that time period.

Sorry to be so vague and non-specific, but I find that retrospective analysis of problems
like this is VERY challenging.

Often, the best you can do is:
1) Try to figure out some way to make the problem happen on demand, so you can cause it and
observe it.
2) Instrument everything, and ensure you are preserving a history of your instrumentation
recordings, so that you can have a mountain of detail when those rare events occur.

It strikes me as being like high-energy physics, where your experiment generates volumes of
data, and it takes weeks or months of analysis afterward to figure out what actually occurred.

Not that I'm much good at high-energy physics, either, I'm afraid. :)

bryan


________________________________
Canoga Perkins
20600 Prairie Street
Chatsworth, CA 91311
(818) 718-6300

This e-mail and any attached document(s) is confidential and is intended only for the review
of the party to whom it is addressed. If you have received this transmission in error, please
notify the sender immediately and discard the original message and any attachment(s).
Mime
View raw message