whirr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tibor Kiss (JIRA)" <j...@apache.org>
Subject [jira] Updated: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2
Date Wed, 02 Feb 2011 13:45:29 GMT

     [ https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tibor Kiss updated WHIRR-167:

    Attachment: whirr-167-5.patch

I rebased from the trunk and rebuilt the patch for the current trunk.

There is a precaution! I was running the integration test for a Hadoop cluster on EC2 but
it fails with the same error as Andrei's failure in https://issues.apache.org/jira/browse/WHIRR-55?focusedCommentId=12978728&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12978728

It seems that there is an another blocking issue before we are able to rerun the integration

Anyway, here is the new patch.

> Improve bootstrapping and configuration to be able to isolate and repair or evict failing
nodes on EC2
> ------------------------------------------------------------------------------------------------------
>                 Key: WHIRR-167
>                 URL: https://issues.apache.org/jira/browse/WHIRR-167
>             Project: Whirr
>          Issue Type: Improvement
>         Environment: Amazon EC2
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>         Attachments: whirr-167-1.patch, whirr-167-2.patch, whirr-167-3.patch, whirr-167-4.patch,
whirr-167-5.patch, whirr-integrationtest.tar.gz, whirr.log
> Actually it is very unstable the cluster startup process on Amazon EC2 instances. How
the number of nodes to be started up is increasing the startup process it fails more often.
But sometimes even 2-3 nodes startup process fails. We don't know how many number of instance
startup is going on at the same time at Amazon side when it fails or when it successfully
starting up. The only think I see is that when I am starting around 10 nodes, the statistics
of failing nodes are higher then with smaller number of nodes and is not direct proportional
with the number of nodes, looks like it is exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for RunNodesException
and don't bail out if only a few " which indicated the current unreliable startup process.
So we should improve it.
> We could add a "max percent failure" property (per instance template), so that if the
number failures exceeded this value the whole cluster fails to launch and is shutdown. For
the master node the value would be 100%, but for datanodes it would be more like 75%. (Tom
White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message