cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sankalp kohli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6747) MessagingService should handle failures on remote nodes.
Date Fri, 04 Apr 2014 16:11:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960085#comment-13960085
] 

sankalp kohli commented on CASSANDRA-6747:
------------------------------------------

Please review v2 with your suggestions. 

> MessagingService should handle failures on remote nodes.
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6747
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: sankalp kohli
>            Priority: Minor
>              Labels: Core
>             Fix For: 2.1 beta2
>
>         Attachments: CASSANDRA-6747-v2.diff, CASSANDRA-6747.diff
>
>
> While going through the code of MessagingService, I discovered that we don't handle callbacks
on failure very well. If a Verb Handler on the remote machine throws an exception, it goes
right through uncaught exception handler. The machine which triggered the message will keep
waiting and will timeout. On timeout, it will so some stuff hard coded in the MS like hints
and add to Latency. There is no way in IAsyncCallback to specify that to do on timeouts and
also on failures. 
> Here are some examples which I found will help if we enhance this system to also propagate
failures back.  So IAsyncCallback will have methods like onFailure.
> 1) From ActiveRepairService.prepareForRepair
>    IAsyncCallback callback = new IAsyncCallback()
>        {
>            @Override
>            public void response(MessageIn msg)
>            {
>                prepareLatch.countDown();
>            }
>            @Override
>            public boolean isLatencyForSnitch()
>            {
>                return false;
>            }
>        };
>        List<UUID> cfIds = new ArrayList<>(columnFamilyStores.size());
>        for (ColumnFamilyStore cfs : columnFamilyStores)
>            cfIds.add(cfs.metadata.cfId);
>        for(InetAddress neighbour : endpoints)
>        {
>            PrepareMessage message = new PrepareMessage(parentRepairSession, cfIds, ranges);
>            MessageOut<RepairMessage> msg = message.createMessage();
>            MessagingService.instance().sendRR(msg, neighbour, callback);
>        }
>        try
>        {
>            prepareLatch.await(1, TimeUnit.HOURS);
>        }
>        catch (InterruptedException e)
>        {
>            parentRepairSessions.remove(parentRepairSession);
>            throw new RuntimeException("Did not get replies from all endpoints.", e);
>        }
> 2) During snapshot phase in repair, if SnapshotVerbHandler throws an exception, we will
wait forever. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message