db-derby-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanley Bradbury <Stan.Bradb...@gmail.com>
Subject Re: ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313))
Date Mon, 31 Mar 2008 22:51:03 GMT
David Sitsky wrote:
> Hi Narayanan,
>
> Yes I have seen those links already.  I have spent quite a bit of time 
> confirming that my hardware is not at fault before posting here.
>
> I think you'll agree to see exactly the same page number failing on 3 
> separate machines lends itself more to being a software issue than a 
> hardware one.
>
> The OS has not reported any disk issues at all.
>
> Cheers,
> David
>
> Narayanan wrote:
>> Hi David,
>>
>> You might find the following links containing earlier discussions on 
>> the similar issue useful,
>>
>> http://www.nabble.com/invalid-checksum-tt9528741.html#a9528741
>>
>> http://www.nabble.com/Derby-crash-%28urgent%29-tt16217446.html#a16265491
>>
>> https://issues.apache.org/jira/browse/DERBY-2475
>>
>> Narayanan
>>
>> David Sitsky wrote:
>>> I have an intensive data-processing application which utilises 
>>> Apache Lucene and Derby, using 6 quad-core machines running Vista 
>>> SP1 and/or Vista Server 2008.
>>>
>>> I have found after 5 or 10 hours of processing, one or a couple of 
>>> my worker processes start reporting the following error in the 
>>> derby.log file:
>>>
>>> ERROR XSDG2: Invalid checksum on Page Page(0,Container(0, 1313))
>>>
>>> The worker process never seems to recover.  Derby locates the error, 
>>> reboots the database, but seems to inevitably report the same error 
>>> again.  It is always page 1313, and what is extra strange is it 
>>> doesn't matter which machine it occurs on, it is always page 1313!  
>>> I know 13 is unlikely, but twice is a row must be extra unlucky. :)
>>>
>>> The quad-core machines have been configured with both hardware and 
>>> software raid, but the same error has been seen.  Windows does not 
>>> report any disk errors in the event log.
>>>
>>> The error is difficult to reproduce.  My runs typically run for 24 
>>> hours, involving 22 separate JVM processes spread across the 
>>> machines, each running their own Derby embedded database.  Sometimes 
>>> I can get through the run without any issues - sometimes I might see 
>>> one or two processes with this issue, and it seems to pick a 
>>> different quad-core machine each time, so the possibility of a 
>>> hardware error seems like unlikely, especially given it is always 
>>> page 1313.
>>>
>>> I have tried both 10.3.1.4 and 10.3.2.1 with the same results.
>>>
>>> Lucene doesn't report any problems with its index, so given all the 
>>> above evidence, I am starting to lean more to a software issue than 
>>> hardware.
>>>
>>> I have attached three derby.log files from different machines.  Does 
>>> anyone have any ideas what might be causing this?
>>>
>
>
Hi Dave -

The problem is happening on the Page 0 [the first page] of conglomerate 
1313 (the conglomerateId) - you can see what table/index this 
corresponds to with the following query:

 select CONGLOMERATENUMBER, CONGLOMERATENAME
  from sys.sysconglomerates
  where conglomeratenumber =  1313;

 From the errors in your log I suspect this to be the table or one of 
the indexes of the failing query listed:
    INSERT INTO text_table (guidhigh, guid, data)

Does your database reboot without any errors after the shutdown/crash?

You wrote:  " /Derby locates the error, reboots the database, but seems 
to inevitably report the same error again./ "

I assume this means you get the same exception when the database reboots 
(indicating the change was written to the transaction log and is 
replaying the error as the database attempts to recover when the db 
reboots).

 From looking at your derby.log files I noted two things that may not be 
important but, because you are having trouble you might try changing to 
see if it makes any difference:

1) the databases are created in the directory D:\temp...   If you 
suspect that this directory is every purged/cleared then this in not a 
good place to have database files.  The files should be in a permanent 
and secure location.

2) the attempts to reboot the database after is shutdown are almost 
instantaneous (well less than a second between the SHUTDOWN timestamp 
and the BOOTING timestamp).  I assume the database is being rebooted 
within the same JVM and would like to see if adding a pause between 
reboots to insure that all objects have been garbage collected changes 
anything.   Alternately if you shutdown the system then attempt to 
access the database from a newly started JVM ( I would use IJ) and still 
get the exception this shows that uncollected derby objects are not the 
problem.

Hope this helps. 








Mime
View raw message