db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: [jira] Created: (DERBY-241) Encrypted run of stress.multi test failed once with a boot error with ibm142
Date Wed, 27 Apr 2005 15:29:14 GMT
Any idea if the whole db is not encrypted, or just that table?  This
is sort of hard to tell from the outside.  A couple of not perfect
suggestions are:

o od -c all the files and do a quick search to see if there is any plain
   text visible.
o Somehow hack properties so that the db does not think it is encrypted
   anymore (either through a debugger or somehow hand editing the binary
   property file).

Sunitha Kambhampati wrote:
> Thanks Mike for your input.  I have some more information on this issue 
> , and it seems to point to an encryption & interrupt problem..
> Here is what I did :
> Below, to be specific I am going to refer to the od -c (character dump ) 
> report for the database described in jira 241 as the od-c_jira241_c180.dat.
> 1) Created an encrypted database in ij and checked the od -c c180.dat  
> and the output is different from that in the od-c_jira241_c180.dat. The 
> characters look encrypted, you cant make out any plain text characters 
> from the dump.
> 2) Created a unencrypted database and checked the od -c c180.dat and 
> that shows similar output as in the od -c_jira241_c180.dat.
> from 1 , 2 , it seems certain that this table should have been encrypted 
> but for some reason it didnt .. maybe because of the interrupts.
> 3) To find out the system catalog table, the conglomerate number was 
> 384, that mapped to c180.dat. Issuing the below query  shows the table as
> ij> select tablename from sys.sysconglomerates a , sys.systables b where 
> a.tableid=b.tableid and conglomeratenumber=384;
> --------------------------------------------------------------------------------------------------------------------------------

> 1 row selected
> ij>
> 4) The data in SYSALIASES does seem to match to what is in the 
> od-c_jira241_c180.dat.
> 5) Regarding no log records for page 5, (0,384) with page version 1 etc 
> is ok . This seems to be because create database flushes the data pages 
> to disk so no log here is fine. Doubleverified by doing the following   
> - I tried to do a dump of the logs on a fresh unencrypted database using 
> derby and the log records for pageversion 1, 2 are missing in this case 
> too.
> The stress test creates 2 functions, which will be 2 inserts into 
> sysaliases.  I created the two functions on the new unencrypted database 
> and the  derby.log log record dump shows 2 insert operations and a space 
> operation for the page (5,(0,384)) which is inline with what I found 
> earlier for the jira241 case.
> So now as it seems like this problem  has to do with encryption, next 
> step seems to be to look in code for cases to see why encryption didnt 
> happen here.
> Sunitha.
> Mike Matrigali wrote:
>> some next steps:
>> 1) figure out what catalog the problem is in.  Since it is likely a 
>> system catalog, creating a new database will result in the same 
>> containerid being used for the same table.
>> 2) interested in result of checking if system tables are encrypted, if 
>> they are supposed to be then finding the clear text disk data file is
>> the problem.  This would point to an encryption bug, possibly JVM, 
>> possibly related to interrupts.  If trying to reproduce probably should
>> keep the JVM and encryption algorithm same as the failure case.
>> 3) probably reasonable to create an encrypted database and do the log 
>> dump, just to verify that clean db also does not have log records for 
>> those first updates during database creation.  Note I think the amount 
>> of stuff done under no logging conditions was recently increased by 
>> Dan, this may be different between the trunk and the released copy.
>> 4) These problems are hard to diagnose without a reproducible case, If 
>> anyone else see's databases not booting fine in the stress test please 
>> post the information - sometimes just getting one more case makes a 
>> pattern obvious.  Note that some errors reported from the stress tests 
>> are just problems of using interupts for shutdown, and are not 
>> recovery issues.  The system gets into a state where the logging 
>> system has been
>> told to shutdown so it basically fails all subsequent requests, which
>> can show up as various errors in the log.  The errors I am most 
>> concerned is if the subsequent sanity check of the database in the 
>> stress test does not boot, there is no non-bug situation currently 
>> where the system should not boot after any type of shutdown (orderly, 
>> thread kills, jvm kills, or machine crash).
>> As you probably saw the stress tests set the flag so that all log files
>> are saved, even those not needed for recovery.  This makes it a little 
>> easier to get a history from the store perspective on what has happened.
>> In this case it looks like there is only one log file, on other 
>> machines it may be more.
>> Sunitha Kambhampati (JIRA) wrote:
>>> Encrypted run of stress.multi test failed once with a boot error with 
>>> ibm142
>>> ----------------------------------------------------------------------------

>>>          Key: DERBY-241
>>>          URL: http://issues.apache.org/jira/browse/DERBY-241
>>>      Project: Derby
>>>         Type: Bug
>>>   Components: Store      Versions:     Environment: ibm142, 
>>> machine is a dell, 1cpu, 256MB RAM, ~497Mhz, has an IDE disk and has 
>>> write cache enabled.
>>>     Reporter: Sunitha Kambhampati
>>> The stress.multi test failed for encryption run with ibm142 on  the 
>>> following kind of machine once when running derbyall suite but have 
>>> not been able to reproduce it since then.
>>> The machine on which it failed is a  - dell, 1cpu, 256MB RAM, 
>>> ~497Mhz, has an IDE disk and has write cache enabled. As far as I can 
>>> tell, the machine was up and running ok when the tests were running.
>>> Looking at the test directory for the stress.multi test, the 
>>> derby.log seems to have a lot of interrupts and looking at the errors 
>>> shows the following boot error.
>>>  Booting Derby version The Apache Software Foundation - Apache Derby 
>>> - alpha - (31132): instance 
>>> c013800d-0103-64b3-44ec-ffffa1f4cf33
>>> on database directory 
>>> E:\classtest\JarResults.2005-04-20\ibm142_derbyall\derbyall\encryptionAll\encryption\multi\stress\mydb

>>> ERROR XSLA7: Cannot redo operation Page Operation: 
>>> Page(5,Container(0, 384)) pageVersion 3 : Insert :  Slot=2 recordId=8 
>>> in the log.
>>> Here are some of my notes in trying to debug this:
>>> 0) Copied the problematic database to a safe location and used sane 
>>> jars for debugging.
>>> 1) Tried to boot the database using ij , and with the following debug 
>>> property set  - derby.debug.true=DumpLogOnly, this dumped all the log 
>>> records into derby.log.  Then searching for log records for the 
>>> container(0,384) - found only 3 log records pertaining to it. there 
>>> is one for create container and 2 records for insert. Space Operation 
>>> for create container ( 0,384)
>>> Page operation for (Page 5, Container(0,384)), version 3 , involving 
>>> an insert at slot 2, record 8. Page operation for  version 4, 
>>> involving insert at slot 3, record 9. => There were no initPage 
>>> operation for this page or any records pertaining to pageversion 
>>> 1,2.   This means that log records were missing, but the only case 
>>> this would be ok  is if it was a system catalog table.  Since in case 
>>> of create database, we flush the data pages to disk itself,  so no 
>>> logs in this case  is OK.
>>> 2)Next step - tried to verify if it was a system catalog table. 
>>> Looking in the org.apache.derby.impl.store.access.RAMAccessManager, 
>>> getNextConglomId(),  the container key - 384 maps to 18th id.   One 
>>> way I verified it was I created another empty database and saw if 
>>> this table existed c180.dat and it did.. which is right that is a 
>>> system catalog table.
>>> 3) To find the actual cause of the redo exception, I put in 
>>> printstack traces in the code, and putting in the debugger  -  the 
>>> error printed was
>>> ERROR XSDB1: Unknown page format at page Page(5,Container(0, 384))
>>> It seemed like the page format was messed up.  I put printlns to get 
>>> hte page format id ( in CachedPage, setIdentity)  and tried to dump 
>>> the contents of the page. The checksum validation actually would have 
>>> happened if all was ok with the format id but since here the format 
>>> id was messed up, this error is thrown instead of a checksum error.
>>> 4) There is a od facility in MKS that dumps the contents in hex and 
>>> character format.  This table mapped to the 18th id, and that is the 
>>> c180.dat in seg0 directory.  Doing a dump  od -c c180.dat shows stuff 
>>> like this :  S   Y   S   C   S   _   B   A   C
>>> 0000034040     K   U   P   _   D   A   T   A   B   A   S   E   _   
>>> A   N   D
>>> 0000034060     _   E   N   A   B   L   E   _   L   O   G   _   A   
>>> R   C   H
>>> 0000034100     I   V   E   _   M   O   D   E  \
>>> These seem to be system catalog procedure names, and it seems weird 
>>> that it would not be encrypted.     Need to verify if system catalogs 
>>> are encrypted, if so then this probably is a interrupt problem with 
>>> encryption.

View raw message