incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: New node unable to stream (0.8.5)
Date Thu, 15 Sep 2011 13:21:38 GMT
Where did the data loss come in?

Scrub is safe to run in parallel.

On Thu, Sep 15, 2011 at 8:08 AM, Ethan Rowe <ethan@the-rowes.com> wrote:
> After further review, I'm definitely going to scrub all the original nodes
> in the cluster.
> We've lost some data as a result of this situation.  It can be restored, but
> the question is what to do with the problematic new node first.  I don't
> particularly care about the data that's on it, since I'm going to re-import
> the critical data from files anyway, and then I can recreate derivative data
> afterwards.  So it's purely a matter of getting the cluster healthy again as
> quickly as possible so I can begin that import process.
> Any issue with running scrubs on multiple nodes at a time, provided they
> aren't replication neighbors?
> On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe <ethan@the-rowes.com> wrote:
>>
>> I just noticed the following from one of Jonathan Ellis' messages
>> yesterday:
>>>
>>> Added to NEWS:
>>>
>>>    - After upgrading, run nodetool scrub against each node before running
>>>      repair, moving nodes, or adding new ones.
>>
>>
>> We did not do this, as it was not indicated as necessary in the news when
>> we were dealing with the upgrade.
>> So perhaps I need to scrub everything before going any further, though the
>> question is what to do with the problematic node.  Additionally, it would be
>> helpful to know if scrub will affect the hinted handoffs that have
>> accumulated, as these seem likely to be part of the set of failing streams.
>> On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe <ethan@the-rowes.com> wrote:
>>>
>>> Here's a typical log slice (not terribly informative, I fear):
>>>>
>>>>  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
>>>> AntiEntropyService.java (l
>>>> ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for
>>>> (299
>>>>
>>>> 90798416657667504332586989223299634,54296681768153272037430773234349600451]
>>>>  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line
>>>> 181)
>>>> Stream context metadata
>>>> [/mnt/cassandra/data/events_production/FitsByShip-g-1
>>>> 0-Data.db sections=88 progress=0/11707163 - 0%,
>>>> /mnt/cassandra/data/events_pr
>>>> oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
>>>> /mnt/c
>>>> assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
>>>> progress=0/
>>>> 6918814 - 0%,
>>>> /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
>>>> ections=260 progress=0/9091780 - 0%], 4 sstables.
>>>>  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java
>>>> (lin
>>>> e 174) Streaming to /10.34.90.8
>>>> ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
>>>> (line
>>>> 139) Fatal exception in thread Thread[Thread-56,5,main]
>>>> java.lang.NullPointerException
>>>>         at
>>>> org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
>>>> onnection.java:174)
>>>>         at
>>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
>>>> ection.java:114)
>>>
>>> Not sure if the exception is related to the outbound streaming above;
>>> other nodes are actively trying to stream to this node, so perhaps it comes
>>> from those and temporal adjacency to the outbound stream is just
>>> coincidental.  I have other snippets that look basically identical to the
>>> above, except if I look at the logs to which this node is trying to stream,
>>> I see that it has concurrently opened a stream in the other direction, which
>>> could be the one that the exception pertains to.
>>>
>>> On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne <sylvain@datastax.com>
>>> wrote:
>>>>
>>>> On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe <ethan@the-rowes.com> wrote:
>>>> > Hi.
>>>> >
>>>> > We've been running a 7-node cluster with RF 3, QUORUM reads/writes in
>>>> > our
>>>> > production environment for a few months.  It's been consistently
>>>> > stable
>>>> > during this period, particularly once we got out maintenance strategy
>>>> > fully
>>>> > worked out (per node, one repair a week, one major compaction a week,
>>>> > the
>>>> > latter due to the nature of our data model and usage).  While this
>>>> > cluster
>>>> > started, back in June or so, on the 0.7 series, it's been running
>>>> > 0.8.3 for
>>>> > a while now with no issues.  We upgraded to 0.8.5 two days ago, having
>>>> > tested the upgrade in our staging cluster (with an otherwise identical
>>>> > configuration) previously and verified that our application's various
>>>> > use
>>>> > cases appeared successful.
>>>> >
>>>> > One of our nodes suffered a disk failure yesterday.  We attempted to
>>>> > replace
>>>> > the dead node by placing a new node at OldNode.initial_token - 1 with
>>>> > auto_bootstrap on.  A few things went awry from there:
>>>> >
>>>> > 1. We never saw the new node in bootstrap mode; it became available
>>>> > pretty
>>>> > much immediately upon joining the ring, and never reported a "joining"
>>>> > state.  I did verify that auto_bootstrap was on.
>>>> >
>>>> > 2. I mistakenly ran repair on the new node rather than removetoken on
>>>> > the
>>>> > old node, due to a delightful mental error.  The repair got nowhere
>>>> > fast, as
>>>> > it attempts to repair against the down node which throws an exception.
>>>> >  So I
>>>> > interrupted the repair, restarted the node to clear any pending
>>>> > validation
>>>> > compactions, and...
>>>> >
>>>> > 3. Ran removetoken for the old node.
>>>> >
>>>> > 4. We let this run for some time and saw eventually that all the nodes
>>>> > appeared to be done various compactions and were stuck at streaming.
>>>> > Many
>>>> > streams listed as open, none making any progress.
>>>> >
>>>> > 5.  I observed an Rpc-related exception on the new node (where the
>>>> > removetoken was launched) and concluded that the streams were broken
>>>> > so the
>>>> > process wouldn't ever finish.
>>>> >
>>>> > 6. Ran a "removetoken force" to get the dead node out of the mix. 
No
>>>> > problems.
>>>> >
>>>> > 7. Ran a repair on the new node.
>>>> >
>>>> > 8. Validations ran, streams opened up, and again things got stuck in
>>>> > streaming, hanging for over an hour with no progress.
>>>> >
>>>> > 9. Musing that lingering tasks from the removetoken could be a factor,
>>>> > I
>>>> > performed a rolling restart and attempted a repair again.
>>>> >
>>>> > 10. Same problem.  Did another rolling restart and attempted a fresh
>>>> > repair
>>>> > on the most important column family alone.
>>>> >
>>>> > 11. Same problem.  Streams included CFs not specified, so I guess they
>>>> > must
>>>> > be for hinted handoff.
>>>> >
>>>> > In concluding that streaming is stuck, I've observed:
>>>> > - streams will be open to the new node from other nodes, but the new
>>>> > node
>>>> > doesn't list them
>>>> > - streams will be open to the other nodes from the new node, but the
>>>> > other
>>>> > nodes don't list them
>>>> > - the streams reported may make some initial progress, but then they
>>>> > hang at
>>>> > a particular point and do not move on for an hour or more.
>>>> > - The logs report repair-related activity, until NPEs on incoming TCP
>>>> > connections show up, which appear likely to be the culprit.
>>>>
>>>> Can you send the stack trace from those NPE.
>>>>
>>>> >
>>>> > I can provide more exact details when I'm done commuting.
>>>> >
>>>> > With streaming broken on this node, I'm unable to run repairs, which
>>>> > is
>>>> > obviously problematic.  The application didn't suffer any operational
>>>> > issues
>>>> > as a consequence of this, but I need to review the overnight results
>>>> > to
>>>> > verify we're not suffering data loss (I doubt we are).
>>>> >
>>>> > At this point, I'm considering a couple options:
>>>> > 1. Remove the new node and let the adjacent node take over its range
>>>> > 2. Bring the new node down, add a new one in front of it, and properly
>>>> > removetoken the problematic one.
>>>> > 3. Bring the new node down, remove all its data except for the system
>>>> > keyspace, then bring it back up and repair it.
>>>> > 4. Revert to 0.8.3 and see if that helps.
>>>> >
>>>> > Recommendations?
>>>> >
>>>> > Thanks.
>>>> > - Ethan
>>>> >
>>>
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Mime
View raw message