db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Matrigali (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DERBY-4239) corruption on z/OS with storerecovery oc_rec? tests. ERROR XSLA7: Cannot redo operation null in the log.
Date Tue, 26 May 2009 15:38:45 GMT

     [ https://issues.apache.org/jira/browse/DERBY-4239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mike Matrigali updated DERBY-4239:
----------------------------------

    Attachment: derby_dumponly.zip

derby_dumponly.log is a derby.log booted on the database that won't boot, with the following
debug flag.  This forces the boot to just print all the log records rather than 
process them:
derby.debug.true=DumpLogOnly,LogTrace

Using this I can see what was going on in the db after the problem log record 
causing the crash.  What I see is that toward the end the is a log record for 
the compress of container 1073:
DEBUG LogTrace OUTPUT: scanned 64851 : Page Operation: Page(0,Container(0, 1073)) pageVersion
107 :  CompressSpaceOperation: newHighestPage = 13;num_p
ages_truncated = 31 to Page(0,Container(0, 1073)) instant = (6,3249863) logEnd = (6,3249908)
logIn at 25 available 4^M

If I look backward in the log from this point I find the most recent operation on 1073:
DEBUG LogTrace OUTPUT: scanned 64820 : Page Operation: Page(29,Container(0, 1024)) pageVersion
66 : Purge : 1 slots starting at 12 (recordId=18) insta
nt = (6,3210739) logEnd = (6,3210791) logIn at 23 available 13^M

And If I again start at that Compress log record and search backward for a 
checkpoint record I find:
DEBUG LogTrace OUTPUT: scanned 61634 : Checkpoint :     redoLWM (4,943)
        undoLWM (4,943)
**************************
org.apache.derby.impl.store.raw.xact.TransactionTable@16ca16ca
Transaction Table: size = 1 largestUpdateXactId = 61634
Xid=61634 gid=null firstLog=(4,943) lastLog=null transactionStatus=0 myxact=null update=true
recovery=true prepare=false needExclusion=true
--------------------------- instant = (4,981) logEnd = (4,1056) logIn at 55 available 4^M

And the last checkpoint in the log is:
DEBUG LogTrace OUTPUT: scanned 64191 : Checkpoint :     redoLWM (6,24)
        undoLWM (5,61)
**************************
org.apache.derby.impl.store.raw.xact.TransactionTable@64a864a8
Transaction Table: size = 2 largestUpdateXactId = 64852
Xid=64191 gid=null firstLog=(6,3250100) lastLog=null transactionStatus=0 myxact=null update=true
recovery=true prepare=false needExclusion=true
Xid=64852 gid=null firstLog=(6,3249980) lastLog=(6,3250018) transactionStatus=0 myxact=null
update=true recovery=true prepare=false needExclusion=true
--------------------------- instant = (6,3250138) logEnd = (6,3250245) logIn at 87 available
4^M

The problem with this is that for redo recovery of compress space record to work
properly, there must be a checkpoint with a redo lwm (low water mark) that is after any operation
on the container before the compress operation happens.
Compress calls checkpoint to make this happen.  The reason is that redo 
recovery wants to replay any log record it has making pages march orderly from
version n to version n+1, ...   But the compress space operation shrinks the file
on disk losing version n, so version n+1 can't be redone.  

In this case the last checkpoint has redo lwm of (6,24) and the last operation is
after that at (6,3210739)

> corruption on z/OS with storerecovery oc_rec? tests.  ERROR XSLA7: Cannot redo operation
null in the log.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: DERBY-4239
>                 URL: https://issues.apache.org/jira/browse/DERBY-4239
>             Project: Derby
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 10.5.1.1
>         Environment: z/OS z10 processor. 
> java version "1.6.0"
> Java(TM) SE Runtime Environment (build pmz3160sr4-20090219_01(SR4))
> IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 z/OS s390-31 jvmmz3160-20090215_29883 (JIT
enabled, AOT enabled)
> J9VM - 20090215_029883_bHdSMr
> JIT  - r9_20090213_2028
> GC   - 20090213_AA)
> JCL  - 20090218_01
> also 
> java version "1.6.0"
> Java(TM) SE Runtime Environment (build pmz3160sr2ifix-20081021_01(SR2+IZ32776+IZ33456))
> IBM J9 VM (build 2.4, J2RE 1.6.0 IBM J9 2.4 z/OS s390-31 jvmmz3160ifx-20081010_24288
(JIT enabled, AOT enabled)
> J9VM - 20081009_024288_bHdSMr
> JIT  - r9_20080721_1330ifx2
> GC   - 20080724_AA)
> JCL  - 20080808_02
>            Reporter: Kathey Marsden
>            Assignee: Mike Matrigali
>            Priority: Critical
>         Attachments: badlogsizes.txt, derby.log, derby.log, derby_dumponly.zip, goodlogsizes.txt,
identifyBadContainer.ksh, reproBackgroundCheckpoint.zip, reproDerby4239.zip, wombat_keeplog_notcorrupt.zip,
wombat_with_keeplog.zip
>
>
> I saw corruption on z/OS with the storerecovery tests and 10.5.1.1.  The failure comes
in oc_rec3 trying to connect to the database, but the actual problem seems to have occurred
with the prior test oc_rec2.  The problem is somewhat intermittent, happening approximately
1/4 times.  I extracted the case from the harness and will attach the reproduction and run
the script repro.ksh.  The script will loop up to 50 times until it gets the failure which
looks like.
> ERROR XSLA7: Cannot redo operation null in the log.
> 	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
> 	at org.apache.derby.impl.store.raw.log.FileLogger.redo(Unknown Source)
> 	at org.apache.derby.impl.store.raw.log.LogToFile.recover(Unknown Source)
> 	at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
> 	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
> 	at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
> 	at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
> 	at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source)
> 	at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown Source)
> 	at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown
Source)
> 	at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown
Source)
> 	at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Unknown Source)
> 	at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source)
> 	at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)
> 	at org.apache.derby.jdbc.Driver40.getNewEmbedConnection(Unknown Source)
> 	at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
> 	at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
> 	at java.sql.DriverManager.getConnection(DriverManager.java:311)
> 	at java.sql.DriverManager.getConnection(DriverManager.java:268)
> 	at CheckTables.main(CheckTables.java:8)
> Caused by: ERROR XSDBB: Unknown page format at page Page(16,Container(0, 1073)), page
dump follows: Hex dump:
> 00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> <snip lots of 000's>
> I ran it with 10.3 and it completed all 50 iterations, so whether JVM or Derby issue
it seems new since 10.3. (I haven't tried with 10.4).  Oddly I have run tests many times before
on this machine using in the 10.5.1.1 release and the same jvm and have never seen this failure,
so am looking into whether maybe something changed on the machine or environment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message