db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sunitha Kambhampati <ksunitha...@gmail.com>
Subject Re: [jira] Created: (DERBY-241) Encrypted run of stress.multi test failed once with a boot error with ibm142
Date Wed, 27 Apr 2005 00:01:36 GMT
Thanks Mike for your input.  I have some more information on this issue 
, and it seems to point to an encryption & interrupt problem..

Here is what I did :
Below, to be specific I am going to refer to the od -c (character dump ) 
report for the database described in jira 241 as the od-c_jira241_c180.dat.

1) Created an encrypted database in ij and checked the od -c c180.dat  
and the output is different from that in the od-c_jira241_c180.dat. The 
characters look encrypted, you cant make out any plain text characters 
from the dump.
2) Created a unencrypted database and checked the od -c c180.dat and 
that shows similar output as in the od -c_jira241_c180.dat.

from 1 , 2 , it seems certain that this table should have been encrypted 
but for some reason it didnt .. maybe because of the interrupts.

3) To find out the system catalog table, the conglomerate number was 
384, that mapped to c180.dat. Issuing the below query  shows the table as
ij> select tablename from sys.sysconglomerates a , sys.systables b where 
a.tableid=b.tableid and conglomeratenumber=384;

1 row selected

4) The data in SYSALIASES does seem to match to what is in the 

5) Regarding no log records for page 5, (0,384) with page version 1 etc 
is ok . This seems to be because create database flushes the data pages 
to disk so no log here is fine. Doubleverified by doing the following   
- I tried to do a dump of the logs on a fresh unencrypted database using 
derby and the log records for pageversion 1, 2 are missing in this case 

The stress test creates 2 functions, which will be 2 inserts into 
sysaliases.  I created the two functions on the new unencrypted database 
and the  derby.log log record dump shows 2 insert operations and a space 
operation for the page (5,(0,384)) which is inline with what I found 
earlier for the jira241 case.

So now as it seems like this problem  has to do with encryption, next 
step seems to be to look in code for cases to see why encryption didnt 
happen here.


Mike Matrigali wrote:

> some next steps:
> 1) figure out what catalog the problem is in.  Since it is likely a 
> system catalog, creating a new database will result in the same 
> containerid being used for the same table.
> 2) interested in result of checking if system tables are encrypted, if 
> they are supposed to be then finding the clear text disk data file is
> the problem.  This would point to an encryption bug, possibly JVM, 
> possibly related to interrupts.  If trying to reproduce probably should
> keep the JVM and encryption algorithm same as the failure case.
> 3) probably reasonable to create an encrypted database and do the log 
> dump, just to verify that clean db also does not have log records for 
> those first updates during database creation.  Note I think the amount 
> of stuff done under no logging conditions was recently increased by 
> Dan, this may be different between the trunk and the released copy.
> 4) These problems are hard to diagnose without a reproducible case, If 
> anyone else see's databases not booting fine in the stress test please 
> post the information - sometimes just getting one more case makes a 
> pattern obvious.  Note that some errors reported from the stress tests 
> are just problems of using interupts for shutdown, and are not 
> recovery issues.  The system gets into a state where the logging 
> system has been
> told to shutdown so it basically fails all subsequent requests, which
> can show up as various errors in the log.  The errors I am most 
> concerned is if the subsequent sanity check of the database in the 
> stress test does not boot, there is no non-bug situation currently 
> where the system should not boot after any type of shutdown (orderly, 
> thread kills, jvm kills, or machine crash).
> As you probably saw the stress tests set the flag so that all log files
> are saved, even those not needed for recovery.  This makes it a little 
> easier to get a history from the store perspective on what has happened.
> In this case it looks like there is only one log file, on other 
> machines it may be more.
> Sunitha Kambhampati (JIRA) wrote:
>> Encrypted run of stress.multi test failed once with a boot error with 
>> ibm142
>> ---------------------------------------------------------------------------- 
>>          Key: DERBY-241
>>          URL: http://issues.apache.org/jira/browse/DERBY-241
>>      Project: Derby
>>         Type: Bug
>>   Components: Store      Versions:     Environment: ibm142, 
>> machine is a dell, 1cpu, 256MB RAM, ~497Mhz, has an IDE disk and has 
>> write cache enabled.
>>     Reporter: Sunitha Kambhampati
>> The stress.multi test failed for encryption run with ibm142 on  the 
>> following kind of machine once when running derbyall suite but have 
>> not been able to reproduce it since then.
>> The machine on which it failed is a  - dell, 1cpu, 256MB RAM, 
>> ~497Mhz, has an IDE disk and has write cache enabled. As far as I can 
>> tell, the machine was up and running ok when the tests were running.
>> Looking at the test directory for the stress.multi test, the 
>> derby.log seems to have a lot of interrupts and looking at the errors 
>> shows the following boot error.
>>  Booting Derby version The Apache Software Foundation - Apache Derby 
>> - alpha - (31132): instance 
>> c013800d-0103-64b3-44ec-ffffa1f4cf33
>> on database directory 
>> E:\classtest\JarResults.2005-04-20\ibm142_derbyall\derbyall\encryptionAll\encryption\multi\stress\mydb

>> ERROR XSLA7: Cannot redo operation Page Operation: 
>> Page(5,Container(0, 384)) pageVersion 3 : Insert :  Slot=2 recordId=8 
>> in the log.
>> Here are some of my notes in trying to debug this:
>> 0) Copied the problematic database to a safe location and used sane 
>> jars for debugging.
>> 1) Tried to boot the database using ij , and with the following debug 
>> property set  - derby.debug.true=DumpLogOnly, this dumped all the log 
>> records into derby.log.  Then searching for log records for the 
>> container(0,384) - found only 3 log records pertaining to it. 
>> there is one for create container and 2 records for insert. Space 
>> Operation for create container ( 0,384)
>> Page operation for (Page 5, Container(0,384)), version 3 , involving 
>> an insert at slot 2, record 8. Page operation for  version 4, 
>> involving insert at slot 3, record 9. 
>> => There were no initPage operation for this page or any records 
>> pertaining to pageversion 1,2.   This means that log records were 
>> missing, but the only case this would be ok  is if it was a system 
>> catalog table.  Since in case of create database, we flush the data 
>> pages to disk itself,  so no logs in this case  is OK.
>> 2)Next step - tried to verify if it was a system catalog table. 
>> Looking in the org.apache.derby.impl.store.access.RAMAccessManager, 
>> getNextConglomId(),  the container key - 384 maps to 18th id.   One 
>> way I verified it was I created another empty database and saw if 
>> this table existed c180.dat and it did.. which is right that is a 
>> system catalog table.
>> 3) To find the actual cause of the redo exception, I put in 
>> printstack traces in the code, and putting in the debugger  -  the 
>> error printed was
>> ERROR XSDB1: Unknown page format at page Page(5,Container(0, 384))
>> It seemed like the page format was messed up.  I put printlns to get 
>> hte page format id ( in CachedPage, setIdentity)  and tried to dump 
>> the contents of the page. The checksum validation actually would have 
>> happened if all was ok with the format id but since here the format 
>> id was messed up, this error is thrown instead of a checksum error.
>> 4) There is a od facility in MKS that dumps the contents in hex and 
>> character format.  This table mapped to the 18th id, and that is the 
>> c180.dat in seg0 directory.  Doing a dump  od -c c180.dat shows stuff 
>> like this :  S   Y   S   C   S   _   B   A   C
>> 0000034040     K   U   P   _   D   A   T   A   B   A   S   E   _   
>> A   N   D
>> 0000034060     _   E   N   A   B   L   E   _   L   O   G   _   A   
>> R   C   H
>> 0000034100     I   V   E   _   M   O   D   E  \
>> These seem to be system catalog procedure names, and it seems weird 
>> that it would not be encrypted.     Need to verify if system catalogs 
>> are encrypted, if so then this probably is a interrupt problem with 
>> encryption.

View raw message