Is the following scenario covered by 2388? I have a test cluster of 6 nodes with a replication factor of 3. Each server can execute hadoop tasks. 1 cassandra node is down for the test.
The job is kicked off from node 1 jobtracker.
A task is executed from node 1, and fails because the local cassandra instance is down
retry on node 6, this tries to connect to node 1 and fails
retry on node 5, this tries to connect to node 1 and fails
retry on node 4, this tries to connect to node 1 and fails
After 4 failures the task is killed and the job fails.
Node 2 and 3 which contain the other replicas never run the task. The node selection seems to be random. I can modify the cassandra code to check connectivity in ColumnFamilyRecordReader but I suspect this is fixing the wrong problem.
Is there a reason that Hadoop cannot select the appropriate node? Is it a configuration problem?
http://mail-archives.apache.org/mod_mbox/cassandra-user/201108.mbox/%3CCALdd-zhMWx5VKfn2EJx8pwOdp-0PNwqMrvHmeeT=5tHt+uXxSw@mail.gmail.com%3E which seem to imply that the scenario will fail, but this comment from mck seems to say it should work