whirr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Baclace (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (WHIRR-378) Auth fail when creating a cluster from an EC2 instance
Date Mon, 19 Sep 2011 23:57:08 GMT

    [ https://issues.apache.org/jira/browse/WHIRR-378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108253#comment-13108253

Paul Baclace commented on WHIRR-378:

I see this issue too (in 0.6.0), as far as I can tell from the description, but the upshot
is some nodes are deleted as dead on arrival and more nodes are allocated so the cluster is
successfully created.  BUT I am charged for 1 hour of time on each apparent DOA node.

One run and found that 2 out of 5 nodes were seemingly dead on arrival (I have many examples
from the same day)  That is a high failure rate, so I wonder whether it was a false positive
DOA.  A summary of the trimmed whirr.log below (last 3 digits of i-number):

1. starting 3 instances/nodes (fbe, fc0, fc2) at 3:37:19
2. problem with a node (fc2) at 3:38:46 or 87 sec. after node start
3. starting a new instance/node (01c) at 3:40:14
4. problem with a another node (01c) at 3:41:19, or 65sec after node start
5. start a new instance/node (040) at 3:41:22
6. delete nodes (01c, fc2) at 3:44:34

The most caused-by ssh error is "net.schmizz.sshj.userauth.UserAuthException: publickey auth

It looks like the overall error "problem applying options to node" is occurring 10 seconds
after opening the socket, so that node is alive to some extent and it does not appear to be
an ssh timeout.  That this happens about 1 minute after instance start makes me think there
could be an implicit timer awaiting boot-up.  (These instances are all using the same private
ami from instance-store and no EBS volumes.)

The failed nodes appear to be deleted after sufficient nodes are started up, not when they
are determined to be failed.  Looking at billing records, I noticed that I *am* being charged
for these failed nodes, so I think this is an important bug to fix. 

-----whirr.log excerpt-------
03:37:19,043 DEBUG [jclouds.compute]  << started instances([region=us-west-1, name=i-f9914fbe])
03:37:19,133 DEBUG [jclouds.compute]  << present instances([region=us-west-1, name=i-f9914fbe])
03:37:19,332 DEBUG [jclouds.compute]  << started instances([region=us-west-1, name=i-87914fc0],[region=us-west-1,
03:37:19,495 DEBUG [jclouds.compute]  << present instances([region=us-west-1, name=i-87914fc0],[region=us-west-1,

03:38:46,153 ERROR [jclouds.compute]  << problem applying options to node(us-west-1/i-85914fc2)

03:40:14,460 DEBUG [jclouds.compute]  << started instances([region=us-west-1, name=i-5b8e501c])
03:40:14,547 DEBUG [jclouds.compute]  << present instances([region=us-west-1, name=i-5b8e501c])

03:41:19,691 ERROR [jclouds.compute]  << problem applying options to node(us-west-1/i-5b8e501c)

03:41:22,738 DEBUG [jclouds.compute]  << started instances([region=us-west-1, name=i-078e5040])
03:41:22,831 DEBUG [jclouds.compute]  << present instances([region=us-west-1, name=i-078e5040])
03:44:34,257 INFO  [org.apache.whirr.actions.BootstrapClusterAction]  Deleting failed node
node us-west-1/i-5b8e501c
03:44:34,259 INFO  [org.apache.whirr.actions.BootstrapClusterAction]  Deleting failed node
node us-west-1/i-85914fc2
03:46:27,948 INFO  [org.apache.whirr.service.FileClusterStateStore] (main) Wrote instances
file instances

The instances file ends up containing:   i-f9914fbe i-87914fc0 i-078e5040
And not containing: i-5b8e501c  i-85914fc2

> Auth fail when creating a cluster from an EC2 instance
> ------------------------------------------------------
>                 Key: WHIRR-378
>                 URL: https://issues.apache.org/jira/browse/WHIRR-378
>             Project: Whirr
>          Issue Type: Bug
>          Components: service/hadoop
>    Affects Versions: 0.6.0
>            Reporter: Marc de Palol
> There is a ssh auth problem when creating a hadoop cluster from an EC2 ubuntu instance.

> I've been using the same configuration file from an EC2 computer an a physical one, everything
works fine in the physical one, but I keep getting this error in EC2: 
> Running configuration script on nodes: [us-east-1/i-c7fde5a6, us-east-1/i-c9fde5a8, us-east-1/i-cbfde5aa]
> <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey
auth failed
> <<authenticated>> woke to: net.schmizz.sshj.userauth.UserAuthException: publickey
auth failed
> The user in the virtual machine is new and with valid .ssh keys.
> The hadoop config file is (omitting commented lines): 
> whirr.cluster-name=hadoop
> whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,3 hadoop-datanode+hadoop-tasktracker
> whirr.provider=aws-ec2
> whirr.identity=****
> whirr.credential=****
> whirr.hardware-id=c1.xlarge
> whirr.image-id=us-east-1/ami-da0cf8b3
> whirr.location-id=us-east-1

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message