incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: repair never completes with "finished successfully"
Date Wed, 13 Apr 2011 01:54:47 GMT
Ah, unreadable rows and in the validation compaction no less. Makes a little more sense now.


Anyone help with the EOF when deserializing columns ? Is the fix to run scrub or drop the
sstable ?

Here's a a theory, AES is trying to...

1) Create TreeRequest 's that specify a range we want to validate. 
2) Send TreeRequest 's to local node and neighbour
3) Process TreeRequest by running a validation compaction (CompactionManager.doValidationCompaction
in your prev stacks)
4) When both TreeRequests return back work out the differences and then stream data if needed.


Perhaps step 3 is not completing because of errors like http://www.mail-archive.com/user@cassandra.apache.org/msg12196.html
If the row is over multiple sstables we can skip the row in one sstable. However if it's in
a single sstable PrecompactedRow will raise an IOError if there is a problem. This is not
what is in the linked error stack that shows a row been skipped, just a hunch we could checkout.

Do you see an IOErrors (not exceptions) in the logs or exceptions with doValidationCompaction
in the stack?

For a tree request on the node you start compaction on you should see these logs...
1) Waiting for repair requests...
2) One of "Stored local tree" or "Stored remote tree" depending on which returns first at
DEBUG level
3) "Queuing comparison"

If we do not have the 3rd log then we did not get a replay from either local or remote. 

Aaron

On 13 Apr 2011, at 00:57, Jonathan Colby wrote:

> There is no "Repair session" message either.   It just starts with a message like:
> 
> INFO [manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723] 2011-04-10 14:00:59,051 AntiEntropyService.java
(line 770) Waiting for repair requests: [#<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723,
/10.46.108.101, (DFS,main)>, #<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723,
/10.47.108.100, (DFS,main)>, #<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723,
/10.47.108.102, (DFS,main)>, #<TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723,
/10.47.108.101, (DFS,main)>]
> 
> NETSTATS:
> 
> Mode: Normal
> Not sending any streams.
> Not receiving any streams.
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0         150846
> Responses                       n/a         0         443183
> 
> One node in our cluster still has "unreadable rows", where the reads trip up every time
for certain sstables (you've probably seen my earlier threads regarding that).   My suspicion
is that the bloom filter read on the node with the corrupt sstables is never reporting back
to the repair, thereby causing it to hang.
> 
> 
> What would be great is a scrub tool that ignores unreadable/unserializable rows!  : )
> 
> 
> On Apr 12, 2011, at 2:15 PM, aaron morton wrote:
> 
>> Do you see a message starting "Repair session " and ending with "completed successfully"
?
>> 
>> Or do you see any streaming activity using "nodetool netstats"
>> 
>> Repair can hang if a neighbour dies and fails to send a requested stream. It will
timeout after 24 hours (I think). 
>> 
>> Aaron
>> 
>> On 12 Apr 2011, at 23:39, Karl Hiramoto wrote:
>> 
>>> On 12/04/2011 13:31, Jonathan Colby wrote:
>>>> There are a few other threads related to problems with the nodetool repair
in 0.7.4.  However I'm not seeing any errors, just never getting a message that the repair
completed successfully.
>>>> 
>>>> In my production and test cluster (with just a few MB data)  the repair nodetool
prompt never returns and the last entry in the cassandra.log is always something like:
>>>> 
>>>> #<TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, /10.46.108.102,
(DFS,main)>  completed successfully: 1 outstanding
>>>> 
>>>> But I don't see a message, even hours later, that the 1 outstanding request
"finished successfully".
>>>> 
>>>> Anyone else experience this?  These are physical server nodes in local data
centers and not EC2
>>>> 
>>> 
>>> I've seen this.   To fix it  try a "nodetool compact" then repair.
>>> 
>>> 
>>> --
>>> Karl
>> 
> 


Mime
View raw message