Return-Path: Delivered-To: apmail-db-derby-dev-archive@www.apache.org Received: (qmail 97560 invoked from network); 27 Apr 2005 17:44:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 27 Apr 2005 17:44:27 -0000 Received: (qmail 25037 invoked by uid 500); 27 Apr 2005 17:45:23 -0000 Delivered-To: apmail-db-derby-dev-archive@db.apache.org Received: (qmail 25002 invoked by uid 500); 27 Apr 2005 17:45:23 -0000 Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Reply-To: "Derby Development" Delivered-To: mailing list derby-dev@db.apache.org Received: (qmail 24986 invoked by uid 99); 27 Apr 2005 17:45:22 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: 32.97.182.142 is neither permitted nor denied by domain of suresh.thalamati@gmail.com) Received: from e2.ny.us.ibm.com (HELO e2.ny.us.ibm.com) (32.97.182.142) by apache.org (qpsmtpd/0.28) with ESMTP; Wed, 27 Apr 2005 10:45:22 -0700 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e2.ny.us.ibm.com (8.12.11/8.12.11) with ESMTP id j3RHiNcL017815 for ; Wed, 27 Apr 2005 13:44:23 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay04.pok.ibm.com (8.12.10/NCO/VER6.6) with ESMTP id j3RHiN8v058132 for ; Wed, 27 Apr 2005 13:44:23 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11/8.13.3) with ESMTP id j3RHiMgb019101 for ; Wed, 27 Apr 2005 13:44:23 -0400 Received: from [127.0.0.1] (tsuresh.svl.ibm.com [9.30.40.238]) by d01av04.pok.ibm.com (8.12.11/8.12.11) with ESMTP id j3RHiL6U019063 for ; Wed, 27 Apr 2005 13:44:22 -0400 Message-ID: <426FCF70.1020802@gmail.com> Date: Wed, 27 Apr 2005 10:44:16 -0700 From: Suresh Thalamati User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Derby Development Subject: Re: [jira] Created: (DERBY-241) Encrypted run of stress.multi test failed once with a boot error with ibm142 References: <986130498.1114544723587.JavaMail.jira@ajax.apache.org> <426E9EA3.6010204@sbcglobal.net> <426ED660.8010700@gmail.com> <426FAFCA.60505@sbcglobal.net> In-Reply-To: <426FAFCA.60505@sbcglobal.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Another thing I would do is to skip the Log record that it is failing on currently through the debugger/by adding skip code. That way you can find some other container that might give better clues, ofcourse not always :-) Thanks -suresht Mike Matrigali wrote: > Any idea if the whole db is not encrypted, or just that table? This > is sort of hard to tell from the outside. A couple of not perfect > suggestions are: > > o od -c all the files and do a quick search to see if there is any plain > text visible. > o Somehow hack properties so that the db does not think it is encrypted > anymore (either through a debugger or somehow hand editing the binary > property file). > > Sunitha Kambhampati wrote: > >> Thanks Mike for your input. I have some more information on this >> issue , and it seems to point to an encryption & interrupt problem.. >> >> Here is what I did : >> Below, to be specific I am going to refer to the od -c (character >> dump ) report for the database described in jira 241 as the >> od-c_jira241_c180.dat. >> >> 1) Created an encrypted database in ij and checked the od -c >> c180.dat and the output is different from that in the >> od-c_jira241_c180.dat. The characters look encrypted, you cant make >> out any plain text characters from the dump. >> 2) Created a unencrypted database and checked the od -c c180.dat and >> that shows similar output as in the od -c_jira241_c180.dat. >> >> from 1 , 2 , it seems certain that this table should have been >> encrypted but for some reason it didnt .. maybe because of the >> interrupts. >> >> 3) To find out the system catalog table, the conglomerate number was >> 384, that mapped to c180.dat. Issuing the below query shows the >> table as >> SYSALIASES. >> ij> select tablename from sys.sysconglomerates a , sys.systables b >> where a.tableid=b.tableid and conglomeratenumber=384; >> TABLENAME >> -------------------------------------------------------------------------------------------------------------------------------- >> >> SYSALIASES >> >> 1 row selected >> ij> >> >> 4) The data in SYSALIASES does seem to match to what is in the >> od-c_jira241_c180.dat. >> 5) Regarding no log records for page 5, (0,384) with page version 1 >> etc is ok . This seems to be because create database flushes the data >> pages to disk so no log here is fine. Doubleverified by doing the >> following - I tried to do a dump of the logs on a fresh unencrypted >> database using derby and the log records for pageversion 1, 2 are >> missing in this case too. >> >> The stress test creates 2 functions, which will be 2 inserts into >> sysaliases. I created the two functions on the new unencrypted >> database and the derby.log log record dump shows 2 insert operations >> and a space operation for the page (5,(0,384)) which is inline with >> what I found earlier for the jira241 case. >> >> So now as it seems like this problem has to do with encryption, next >> step seems to be to look in code for cases to see why encryption >> didnt happen here. >> >> Sunitha. >> >> Mike Matrigali wrote: >> >>> some next steps: >>> >>> 1) figure out what catalog the problem is in. Since it is likely a >>> system catalog, creating a new database will result in the same >>> containerid being used for the same table. >>> >>> 2) interested in result of checking if system tables are encrypted, >>> if they are supposed to be then finding the clear text disk data >>> file is >>> the problem. This would point to an encryption bug, possibly JVM, >>> possibly related to interrupts. If trying to reproduce probably should >>> keep the JVM and encryption algorithm same as the failure case. >>> >>> 3) probably reasonable to create an encrypted database and do the >>> log dump, just to verify that clean db also does not have log >>> records for those first updates during database creation. Note I >>> think the amount of stuff done under no logging conditions was >>> recently increased by Dan, this may be different between the trunk >>> and the released copy. >>> >>> 4) These problems are hard to diagnose without a reproducible case, >>> If anyone else see's databases not booting fine in the stress test >>> please post the information - sometimes just getting one more case >>> makes a pattern obvious. Note that some errors reported from the >>> stress tests are just problems of using interupts for shutdown, and >>> are not recovery issues. The system gets into a state where the >>> logging system has been >>> told to shutdown so it basically fails all subsequent requests, which >>> can show up as various errors in the log. The errors I am most >>> concerned is if the subsequent sanity check of the database in the >>> stress test does not boot, there is no non-bug situation currently >>> where the system should not boot after any type of shutdown >>> (orderly, thread kills, jvm kills, or machine crash). >>> >>> As you probably saw the stress tests set the flag so that all log files >>> are saved, even those not needed for recovery. This makes it a >>> little easier to get a history from the store perspective on what >>> has happened. >>> In this case it looks like there is only one log file, on other >>> machines it may be more. >>> >>> Sunitha Kambhampati (JIRA) wrote: >>> >>>> Encrypted run of stress.multi test failed once with a boot error >>>> with ibm142 >>>> ---------------------------------------------------------------------------- >>>> >>>> >>>> Key: DERBY-241 >>>> URL: http://issues.apache.org/jira/browse/DERBY-241 >>>> Project: Derby >>>> Type: Bug >>>> Components: Store Versions: 10.1.0.0 Environment: >>>> ibm142, machine is a dell, 1cpu, 256MB RAM, ~497Mhz, has an IDE >>>> disk and has write cache enabled. >>>> Reporter: Sunitha Kambhampati >>>> >>>> >>>> The stress.multi test failed for encryption run with ibm142 on the >>>> following kind of machine once when running derbyall suite but have >>>> not been able to reproduce it since then. >>>> The machine on which it failed is a - dell, 1cpu, 256MB RAM, >>>> ~497Mhz, has an IDE disk and has write cache enabled. As far as I >>>> can tell, the machine was up and running ok when the tests were >>>> running. >>>> >>>> Looking at the test directory for the stress.multi test, the >>>> derby.log seems to have a lot of interrupts and looking at the >>>> errors shows the following boot error. >>>> Booting Derby version The Apache Software Foundation - Apache >>>> Derby - 10.1.0.0 alpha - (31132): instance >>>> c013800d-0103-64b3-44ec-ffffa1f4cf33 >>>> on database directory >>>> E:\classtest\JarResults.2005-04-20\ibm142_derbyall\derbyall\encryptionAll\encryption\multi\stress\mydb >>>> ERROR XSLA7: Cannot redo operation Page Operation: >>>> Page(5,Container(0, 384)) pageVersion 3 : Insert : Slot=2 >>>> recordId=8 in the log. >>>> >>>> Here are some of my notes in trying to debug this: >>>> 0) Copied the problematic database to a safe location and used sane >>>> jars for debugging. >>>> 1) Tried to boot the database using ij , and with the following >>>> debug property set - derby.debug.true=DumpLogOnly, this dumped all >>>> the log records into derby.log. Then searching for log records for >>>> the container(0,384) - found only 3 log records pertaining to it. >>>> there is one for create container and 2 records for insert. Space >>>> Operation for create container ( 0,384) >>>> Page operation for (Page 5, Container(0,384)), version 3 , >>>> involving an insert at slot 2, record 8. Page operation for >>>> version 4, involving insert at slot 3, record 9. => There were no >>>> initPage operation for this page or any records pertaining to >>>> pageversion 1,2. This means that log records were missing, but >>>> the only case this would be ok is if it was a system catalog >>>> table. Since in case of create database, we flush the data pages >>>> to disk itself, so no logs in this case is OK. >>>> 2)Next step - tried to verify if it was a system catalog table. >>>> Looking in the org.apache.derby.impl.store.access.RAMAccessManager, >>>> getNextConglomId(), the container key - 384 maps to 18th id. One >>>> way I verified it was I created another empty database and saw if >>>> this table existed c180.dat and it did.. which is right that is a >>>> system catalog table. >>>> 3) To find the actual cause of the redo exception, I put in >>>> printstack traces in the code, and putting in the debugger - the >>>> error printed was >>>> ERROR XSDB1: Unknown page format at page Page(5,Container(0, 384)) >>>> It seemed like the page format was messed up. I put printlns to >>>> get hte page format id ( in CachedPage, setIdentity) and tried to >>>> dump the contents of the page. The checksum validation actually >>>> would have happened if all was ok with the format id but since here >>>> the format id was messed up, this error is thrown instead of a >>>> checksum error. >>>> 4) There is a od facility in MKS that dumps the contents in hex and >>>> character format. This table mapped to the 18th id, and that is >>>> the c180.dat in seg0 directory. Doing a dump od -c c180.dat shows >>>> stuff like this : S Y S C S _ B A C >>>> 0000034040 K U P _ D A T A B A S E _ >>>> A N D >>>> 0000034060 _ E N A B L E _ L O G _ A >>>> R C H >>>> 0000034100 I V E _ M O D E \ >>>> >>>> These seem to be system catalog procedure names, and it seems weird >>>> that it would not be encrypted. Need to verify if system >>>> catalogs are encrypted, if so then this probably is a interrupt >>>> problem with encryption. >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >> > >