Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Derby Development" <derby-dev@db.apache.org>
Received-SPF: neutral (hermes.apache.org: 32.97.182.142 is neither permitted
 nor denied by domain of suresh.thalamati@gmail.com)
Message-ID: <426FCF70.1020802@gmail.com>
Date: Wed, 27 Apr 2005 10:44:16 -0700
From: Suresh Thalamati <suresh.thalamati@gmail.com>
User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
MIME-Version: 1.0
To: Derby Development <derby-dev@db.apache.org>
Subject: Re: [jira] Created: (DERBY-241) Encrypted run of stress.multi test
 failed once with a boot error with ibm142
References: <986130498.1114544723587.JavaMail.jira@ajax.apache.org>
 <426E9EA3.6010204@sbcglobal.net> <426ED660.8010700@gmail.com>
 <426FAFCA.60505@sbcglobal.net>
In-Reply-To: <426FAFCA.60505@sbcglobal.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Another thing I would  do is to skip the Log record that it is failing 
on currently through the debugger/by adding skip code.
That way you can  find some other container that might give better 
clues, ofcourse not always :-) 

Thanks
-suresht


Mike Matrigali wrote:

> Any idea if the whole db is not encrypted, or just that table?  This
> is sort of hard to tell from the outside.  A couple of not perfect
> suggestions are:
>
> o od -c all the files and do a quick search to see if there is any plain
>   text visible.
> o Somehow hack properties so that the db does not think it is encrypted
>   anymore (either through a debugger or somehow hand editing the binary
>   property file).
>
> Sunitha Kambhampati wrote:
>
>> Thanks Mike for your input.  I have some more information on this 
>> issue , and it seems to point to an encryption & interrupt problem..
>>
>> Here is what I did :
>> Below, to be specific I am going to refer to the od -c (character 
>> dump ) report for the database described in jira 241 as the 
>> od-c_jira241_c180.dat.
>>
>> 1) Created an encrypted database in ij and checked the od -c 
>> c180.dat  and the output is different from that in the 
>> od-c_jira241_c180.dat. The characters look encrypted, you cant make 
>> out any plain text characters from the dump.
>> 2) Created a unencrypted database and checked the od -c c180.dat and 
>> that shows similar output as in the od -c_jira241_c180.dat.
>>
>> from 1 , 2 , it seems certain that this table should have been 
>> encrypted but for some reason it didnt .. maybe because of the 
>> interrupts.
>>
>> 3) To find out the system catalog table, the conglomerate number was 
>> 384, that mapped to c180.dat. Issuing the below query  shows the 
>> table as
>> SYSALIASES.
>> ij> select tablename from sys.sysconglomerates a , sys.systables b 
>> where a.tableid=b.tableid and conglomeratenumber=384;
>> TABLENAME
>> -------------------------------------------------------------------------------------------------------------------------------- 
>>
>> SYSALIASES
>>
>> 1 row selected
>> ij>
>>
>> 4) The data in SYSALIASES does seem to match to what is in the 
>> od-c_jira241_c180.dat.
>> 5) Regarding no log records for page 5, (0,384) with page version 1 
>> etc is ok . This seems to be because create database flushes the data 
>> pages to disk so no log here is fine. Doubleverified by doing the 
>> following   - I tried to do a dump of the logs on a fresh unencrypted 
>> database using derby and the log records for pageversion 1, 2 are 
>> missing in this case too.
>>
>> The stress test creates 2 functions, which will be 2 inserts into 
>> sysaliases.  I created the two functions on the new unencrypted 
>> database and the  derby.log log record dump shows 2 insert operations 
>> and a space operation for the page (5,(0,384)) which is inline with 
>> what I found earlier for the jira241 case.
>>
>> So now as it seems like this problem  has to do with encryption, next 
>> step seems to be to look in code for cases to see why encryption 
>> didnt happen here.
>>
>> Sunitha.
>>
>> Mike Matrigali wrote:
>>
>>> some next steps:
>>>
>>> 1) figure out what catalog the problem is in.  Since it is likely a 
>>> system catalog, creating a new database will result in the same 
>>> containerid being used for the same table.
>>>
>>> 2) interested in result of checking if system tables are encrypted, 
>>> if they are supposed to be then finding the clear text disk data 
>>> file is
>>> the problem.  This would point to an encryption bug, possibly JVM, 
>>> possibly related to interrupts.  If trying to reproduce probably should
>>> keep the JVM and encryption algorithm same as the failure case.
>>>
>>> 3) probably reasonable to create an encrypted database and do the 
>>> log dump, just to verify that clean db also does not have log 
>>> records for those first updates during database creation.  Note I 
>>> think the amount of stuff done under no logging conditions was 
>>> recently increased by Dan, this may be different between the trunk 
>>> and the released copy.
>>>
>>> 4) These problems are hard to diagnose without a reproducible case, 
>>> If anyone else see's databases not booting fine in the stress test 
>>> please post the information - sometimes just getting one more case 
>>> makes a pattern obvious.  Note that some errors reported from the 
>>> stress tests are just problems of using interupts for shutdown, and 
>>> are not recovery issues.  The system gets into a state where the 
>>> logging system has been
>>> told to shutdown so it basically fails all subsequent requests, which
>>> can show up as various errors in the log.  The errors I am most 
>>> concerned is if the subsequent sanity check of the database in the 
>>> stress test does not boot, there is no non-bug situation currently 
>>> where the system should not boot after any type of shutdown 
>>> (orderly, thread kills, jvm kills, or machine crash).
>>>
>>> As you probably saw the stress tests set the flag so that all log files
>>> are saved, even those not needed for recovery.  This makes it a 
>>> little easier to get a history from the store perspective on what 
>>> has happened.
>>> In this case it looks like there is only one log file, on other 
>>> machines it may be more.
>>>
>>> Sunitha Kambhampati (JIRA) wrote:
>>>
>>>> Encrypted run of stress.multi test failed once with a boot error 
>>>> with ibm142
>>>> ---------------------------------------------------------------------------- 
>>>>
>>>>
>>>>          Key: DERBY-241
>>>>          URL: http://issues.apache.org/jira/browse/DERBY-241
>>>>      Project: Derby
>>>>         Type: Bug
>>>>   Components: Store      Versions: 10.1.0.0     Environment: 
>>>> ibm142, machine is a dell, 1cpu, 256MB RAM, ~497Mhz, has an IDE 
>>>> disk and has write cache enabled.
>>>>     Reporter: Sunitha Kambhampati
>>>>
>>>>
>>>> The stress.multi test failed for encryption run with ibm142 on  the 
>>>> following kind of machine once when running derbyall suite but have 
>>>> not been able to reproduce it since then.
>>>> The machine on which it failed is a  - dell, 1cpu, 256MB RAM, 
>>>> ~497Mhz, has an IDE disk and has write cache enabled. As far as I 
>>>> can tell, the machine was up and running ok when the tests were 
>>>> running.
>>>>
>>>> Looking at the test directory for the stress.multi test, the 
>>>> derby.log seems to have a lot of interrupts and looking at the 
>>>> errors shows the following boot error.
>>>>  Booting Derby version The Apache Software Foundation - Apache 
>>>> Derby - 10.1.0.0 alpha - (31132): instance 
>>>> c013800d-0103-64b3-44ec-ffffa1f4cf33
>>>> on database directory 
>>>> E:\classtest\JarResults.2005-04-20\ibm142_derbyall\derbyall\encryptionAll\encryption\multi\stress\mydb 
>>>> ERROR XSLA7: Cannot redo operation Page Operation: 
>>>> Page(5,Container(0, 384)) pageVersion 3 : Insert :  Slot=2 
>>>> recordId=8 in the log.
>>>>
>>>> Here are some of my notes in trying to debug this:
>>>> 0) Copied the problematic database to a safe location and used sane 
>>>> jars for debugging.
>>>> 1) Tried to boot the database using ij , and with the following 
>>>> debug property set  - derby.debug.true=DumpLogOnly, this dumped all 
>>>> the log records into derby.log.  Then searching for log records for 
>>>> the container(0,384) - found only 3 log records pertaining to it. 
>>>> there is one for create container and 2 records for insert. Space 
>>>> Operation for create container ( 0,384)
>>>> Page operation for (Page 5, Container(0,384)), version 3 , 
>>>> involving an insert at slot 2, record 8. Page operation for  
>>>> version 4, involving insert at slot 3, record 9. => There were no 
>>>> initPage operation for this page or any records pertaining to 
>>>> pageversion 1,2.   This means that log records were missing, but 
>>>> the only case this would be ok  is if it was a system catalog 
>>>> table.  Since in case of create database, we flush the data pages 
>>>> to disk itself,  so no logs in this case  is OK.
>>>> 2)Next step - tried to verify if it was a system catalog table. 
>>>> Looking in the org.apache.derby.impl.store.access.RAMAccessManager, 
>>>> getNextConglomId(),  the container key - 384 maps to 18th id.   One 
>>>> way I verified it was I created another empty database and saw if 
>>>> this table existed c180.dat and it did.. which is right that is a 
>>>> system catalog table.
>>>> 3) To find the actual cause of the redo exception, I put in 
>>>> printstack traces in the code, and putting in the debugger  -  the 
>>>> error printed was
>>>> ERROR XSDB1: Unknown page format at page Page(5,Container(0, 384))
>>>> It seemed like the page format was messed up.  I put printlns to 
>>>> get hte page format id ( in CachedPage, setIdentity)  and tried to 
>>>> dump the contents of the page. The checksum validation actually 
>>>> would have happened if all was ok with the format id but since here 
>>>> the format id was messed up, this error is thrown instead of a 
>>>> checksum error.
>>>> 4) There is a od facility in MKS that dumps the contents in hex and 
>>>> character format.  This table mapped to the 18th id, and that is 
>>>> the c180.dat in seg0 directory.  Doing a dump  od -c c180.dat shows 
>>>> stuff like this :  S   Y   S   C   S   _   B   A   C
>>>> 0000034040     K   U   P   _   D   A   T   A   B   A   S   E   _   
>>>> A   N   D
>>>> 0000034060     _   E   N   A   B   L   E   _   L   O   G   _   A   
>>>> R   C   H
>>>> 0000034100     I   V   E   _   M   O   D   E  \
>>>>
>>>> These seem to be system catalog procedure names, and it seems weird 
>>>> that it would not be encrypted.     Need to verify if system 
>>>> catalogs are encrypted, if so then this probably is a interrupt 
>>>> problem with encryption.
>>>>
>>>>
>>>>
>>>>  
>>>>
>>>
>>>
>>
>>
>>
>
>