db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Sitsky <s...@nuix.com>
Subject Re: ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313))
Date Mon, 31 Mar 2008 22:20:51 GMT
For what its worth, I did another run last night on my 6 quad-core 
system.  This time I had the derby issue happen for a JVM process on 
machine 1, two processes on machine 4, and one on machine 5.  I run four 
JVM processes per quad-core machine.

All the JVM processes have roughly the same data processing rate, but 
the issue happens at different times into the load.  The problem 
occurred around time 420, 480, 800 and 900 minutes into the load for the 
four problematic processes.

As is always the case - all of them report the issue on page 1313, and 
there are no disk errors reported by Windows.

Any ideas what the problem might be?  I am happy to invest the time to 
track down the issue, but I need some guidance from the Derby gurus.

I noticed 1313 in binary is 10100100001.  Does this have special 
significance within Derby's binary tree structures?


David Sitsky wrote:
> Hi Narayanan,
> Yes I have seen those links already.  I have spent quite a bit of time 
> confirming that my hardware is not at fault before posting here.
> I think you'll agree to see exactly the same page number failing on 3 
> separate machines lends itself more to being a software issue than a 
> hardware one.
> The OS has not reported any disk issues at all.
> Cheers,
> David
> Narayanan wrote:
>> Hi David,
>> You might find the following links containing earlier discussions on 
>> the similar issue useful,
>> http://www.nabble.com/invalid-checksum-tt9528741.html#a9528741
>> http://www.nabble.com/Derby-crash-%28urgent%29-tt16217446.html#a16265491
>> https://issues.apache.org/jira/browse/DERBY-2475
>> Narayanan
>> David Sitsky wrote:
>>> I have an intensive data-processing application which utilises Apache 
>>> Lucene and Derby, using 6 quad-core machines running Vista SP1 and/or 
>>> Vista Server 2008.
>>> I have found after 5 or 10 hours of processing, one or a couple of my 
>>> worker processes start reporting the following error in the derby.log 
>>> file:
>>> ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313))
>>> The worker process never seems to recover.  Derby locates the error, 
>>> reboots the database, but seems to inevitably report the same error 
>>> again.  It is always page 1313, and what is extra strange is it 
>>> doesn't matter which machine it occurs on, it is always page 1313!  I 
>>> know 13 is unlikely, but twice is a row must be extra unlucky. :)
>>> The quad-core machines have been configured with both hardware and 
>>> software raid, but the same error has been seen.  Windows does not 
>>> report any disk errors in the event log.
>>> The error is difficult to reproduce.  My runs typically run for 24 
>>> hours, involving 22 separate JVM processes spread across the 
>>> machines, each running their own Derby embedded database.  Sometimes 
>>> I can get through the run without any issues - sometimes I might see 
>>> one or two processes with this issue, and it seems to pick a 
>>> different quad-core machine each time, so the possibility of a 
>>> hardware error seems like unlikely, especially given it is always 
>>> page 1313.
>>> I have tried both and with the same results.
>>> Lucene doesn't report any problems with its index, so given all the 
>>> above evidence, I am starting to lean more to a software issue than 
>>> hardware.
>>> I have attached three derby.log files from different machines.  Does 
>>> anyone have any ideas what might be causing this?

View raw message