Hi Matthieu,

I reconfigured the /etc/hosts file, matching real IP address with local machine name, instead of 127.0.0.1.
Then it worked!
Thank you so much.

Now here comes another problem.
I followed the steps of Run Twitter Trending Example. And I set up two newCluster on the same server testing.machine1:2182.
Then I set up two nodes of cluster1 and one node of cluster2 on the same server testing.machine1:2182.

When I deployed twitter-counter app on cluster1, there was no problem
When I deployed twitter-adapter app on cluster2, it did not work.

[root@testing apache-s4-0.5.0-incubating-src]# ./s4 deploy -s4r=/usr/apache-s4-0.5.0-incubating-src/test-apps/twitter-adapter/build/libs/twitter-adapter.s4r -c=cluster2 -appName=twitter-adapter -zk=testing.machine1:2182
10:31:27.178 [main] ERROR org.apache.s4.tools.Deploy - Cannot deploy app
org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
    at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:876) ~[zkclient-0.1.jar:na]
    at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98) ~[zkclient-0.1.jar:na]
    at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:92) ~[zkclient-0.1.jar:na]
    at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:76) ~[zkclient-0.1.jar:na]
    at org.apache.s4.tools.Deploy.main(Deploy.java:59) ~[s4-tools-0.5.0-incubating.jar:0.5.0-incubating]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.6.0_22]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.6.0_22]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.6.0_22]
    at java.lang.reflect.Method.invoke(Method.java:616) ~[na:1.6.0_22]
    at org.apache.s4.tools.Tools$Task.dispatch(Tools.java:54) [s4-tools-0.5.0-incubating.jar:0.5.0-incubating]
    at org.apache.s4.tools.Tools.main(Tools.java:94) [s4-tools-0.5.0-incubating.jar:0.5.0-incubating]


Then I deployed twitter-adapter app on another server on machine1, machine1:2183. It worked.

[root@testing apache-s4-0.5.0-incubating-src]# ./s4 deploy -s4r=/usr/apache-s4-0.5.0-incubating-src/test-apps/twitter-adapter/build/libs/twitter-adapter.s4r -c=cluster2 -appName=twitter-adapter -zk=testing.machine1:2183
10:33:00.830 [main] INFO  org.apache.s4.tools.Deploy - Using specified S4R [/usr/apache-s4-0.5.0-incubating-src/test-apps/twitter-adapter/build/libs/twitter-adapter.s4r], the S4R archive will not be built from source (and corresponding parameters are ignored)
10:33:00.911 [main] INFO  org.apache.s4.tools.Deploy - uploaded application [twitter-adapter] to cluster [cluster2], using zookeeper znode [/s4/clusters/cluster2/app/twitter-adapter], and s4r file [/usr/apache-s4-0.5.0-incubating-src/test-apps/twitter-adapter/build/libs/twitter-adapter.s4r]


Then I checked the logs of PE node, the error is as follows.

[root@testing apache-s4-0.5.0-incubating-src]# ./s4 node -c=cluster1 -zk=testing.machine1:2182
10:36:05.165 [main] INFO  org.apache.s4.core.Main - Initializing S4 node with :
- comm module class [org.apache.s4.comm.DefaultCommModule]
- comm configuration file [default.s4.comm.properties from classpath]
- core module class [org.apache.s4.core.DefaultCoreModule]
- core configuration file[default.s4.core.properties from classpath]
- extra modules: []
- inline parameters: []
10:36:05.175 [main] DEBUG org.apache.s4.core.Main - Adding named parameters for injection : [s4.cluster.zk_address=testing.machine1:2182]
10:36:05.525 [main] INFO  org.apache.s4.core.Main - Starting S4 node. This node will automatically download applications published for the cluster it belongs to
10:36:16.745 [main] ERROR org.apache.s4.core.Main - Cannot start S4 node
com.google.inject.ProvisionException: Guice provision errors:

1) Error injecting constructor, org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
  at org.apache.s4.core.Server.<init>(Server.java:71)
  while locating org.apache.s4.core.Server

1 error
    at com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:987) ~[guice-3.0.jar:na]
    at com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1013) ~[guice-3.0.jar:na]
    at org.apache.s4.core.Main.startNode(Main.java:148) [s4-core-0.5.0-incubating.jar:0.5.0-incubating]
    at org.apache.s4.core.Main.main(Main.java:75) [s4-core-0.5.0-incubating.jar:0.5.0-incubating]
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.6.0_22]
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.6.0_22]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.6.0_22]
    at java.lang.reflect.Method.invoke(Method.java:616) ~[na:1.6.0_22]
    at org.apache.s4.tools.Tools$Task.dispatch(Tools.java:54) [s4-tools-0.5.0-incubating.jar:0.5.0-incubating]
    at org.apache.s4.tools.Tools.main(Tools.java:94) [s4-tools-0.5.0-incubating.jar:0.5.0-incubating]
Caused by: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
    at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:876) ~[zkclient-0.1.jar:na]
    at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98) ~[zkclient-0.1.jar:na]
    at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:92) ~[zkclient-0.1.jar:na]
    at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:80) ~[zkclient-0.1.jar:na]
    at org.apache.s4.core.Server.<init>(Server.java:74) ~[s4-core-0.5.0-incubating.jar:0.5.0-incubating]
    at org.apache.s4.core.Server$$FastClassByGuice$$69e0fd5b.newInstance(<generated>) ~[guice-3.0.jar:0.5.0-incubating]
    at com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40) ~[guice-3.0.jar:na]
    at com.google.inject.internal.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:60) ~[guice-3.0.jar:na]
    at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:85) ~[guice-3.0.jar:na]
    at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:254) ~[guice-3.0.jar:na]
    at com.google.inject.internal.InjectorImpl$4$1.call(InjectorImpl.java:978) ~[guice-3.0.jar:na]
    at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1024) ~[guice-3.0.jar:na]
    at com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:974) ~[guice-3.0.jar:na]
    ... 9 common frames omitted



Looking forward to your reply.
Thanks.

Sincerely,
Yu



On Thu, Sep 20, 2012 at 5:25 PM, Matthieu Morel <mmorel@apache.org> wrote:
Hi,

as far as I can tell from the logs, the local host name of the node is not resolved correctly: it is resolved as "localhost" instead of the fully qualified host name.

S4 currently uses the following method to resolve the local host name:
InetAddress.getLocalHost().getCanonicalHostName()

Note that getLocalHost() will return the loopback address if you have a security manager that doesn't allow to resolve the localhost. Otherwise there might be something wrong with you /etc/hosts file (or equivalent).

Hope this helps,

Matthieu



On 9/20/12 4:49 AM, Frank Zheng wrote:
Hi,

I ran the Twitter Trending Example on two machines to test the S4
Fail-over Mechanism.
Firstly I set up five ZooKeeper servers, three on machine1 and two on
machine2.
Then I set up two PE nodes and one adapter node on machine2. Afterwards
I set up three standby nodes on machine1, two for PE and one for adapter.
When I shut down one PE node on machine2, the ZooKeeper distributed
tasks to one standby node for PE on machine1. But that node got tasks
and did not work correctly.

Standby PE Node on machine1

10:35:57.742 [main] INFO  org.apache.s4.core.Main - Initializing S4 node
with :
- comm module class [org.apache.s4.comm.DefaultCommModule]
- comm configuration file [default.s4.comm.properties from classpath]
- core module class [org.apache.s4.core.DefaultCoreModule]
- core configuration file[default.s4.core.properties from classpath]
- extra modules: []
- inline parameters: []
10:35:57.752 [main] DEBUG org.apache.s4.core.Main - Adding named
parameters for injection : [s4.cluster.zk_address=testing.machine1:2182]
10:35:58.073 [main] INFO  org.apache.s4.core.Main - Starting S4 node.
This node will automatically download applications published for the
cluster it belongs to
10:35:58.175 [main] INFO  o.a.s.comm.topology.AssignmentFromZK - New
session:88349612453724188; state is : SyncConnected
10:35:58.185 [main] INFO  o.a.s.comm.topology.AssignmentFromZK - Could
not acquire task. Going into standby mode
10:35:58.254 [main] INFO  org.apache.s4.core.Server - Loading
application [twitter-counter] from file [/tmp/tmp3353582636362855640s4r]
10:35:58.255 [main] WARN  o.a.s4.base.util.S4RLoaderFactory - s4.tmp.dir
not specified, using temporary directory [/tmp/1348108558254-0] for
unpacking S4R. You may want to specify a parent non-temporary directory.
10:35:58.255 [main] INFO  o.a.s4.base.util.S4RLoaderFactory - Unzipping
S4R archive in [/tmp/1348108558254-0]
10:35:58.351 [main] INFO  org.apache.s4.core.Server - App class name is:
org.apache.s4.example.twitter.TwitterCounterApp
10:35:58.423 [main] INFO  o.a.s4.comm.topology.ClusterFromZK - Changing
cluster topology to {
nbNodes=2,name=cluster1,mode=unicast,type=,nodes=[{partition=0,port=12000,machineName=localhost,taskId=Task-0},
{partition=1,port=12001,machineName=localhost,taskId=Task-1}]} from null
10:35:58.458 [main] INFO  o.a.s4.comm.topology.ClusterFromZK - Adding
topology change listener:org.apache.s4.comm.tcp.TCPEmitter@e4c6320


When one working PE node failed on machine2, the standby PE node had
logs as follows

10:39:24.047 [ZkClient-EventThread-19-testing.machine1:2182] INFO
o.a.s4.comm.topology.ClusterFromZK - Changing cluster topology to {
nbNodes=1,name=cluster1,mode=unicast,type=,nodes=[{partition=1,port=12001,machineName=localhost,taskId=Task-1}]}
from {
nbNodes=2,name=cluster1,mode=unicast,type=,nodes=[{partition=0,port=12000,machineName=localhost,taskId=Task-0},
{partition=1,port=12001,machineName=localhost,taskId=Task-1}]}
10:39:24.102 [ZkClient-EventThread-16-testing.machine1:2182] INFO
o.a.s.comm.topology.AssignmentFromZK - Successfully acquired task:Task-0
by localhost
10:39:24.116 [ZkClient-EventThread-19-testing.machine1:2182] INFO
o.a.s4.comm.topology.ClusterFromZK - Changing cluster topology to {
nbNodes=2,name=cluster1,mode=unicast,type=,nodes=[{partition=0,port=12000,machineName=localhost,taskId=Task-0},
{partition=1,port=12001,machineName=localhost,taskId=Task-1}]} from {
nbNodes=1,name=cluster1,mode=unicast,type=,nodes=[{partition=1,port=12001,machineName=localhost,taskId=Task-1}]}
10:39:24.159 [main] INFO  o.a.s4.comm.topology.ClustersFromZK - New
session:88349612453724194
10:39:24.162 [main] INFO  o.a.s4.comm.topology.ClustersFromZK - Detected
new stream [RawStatus]
10:39:24.193 [main] INFO  o.a.s4.comm.topology.ClustersFromZK - New
session:88349612453724195
10:39:24.205 [main] INFO  o.a.s4.comm.topology.ClusterFromZK - Changing
cluster topology to {
nbNodes=2,name=cluster1,mode=unicast,type=,nodes=[{partition=0,port=12000,machineName=localhost,taskId=Task-0},
{partition=1,port=12001,machineName=localhost,taskId=Task-1}]} from null
10:39:24.212 [main] INFO  o.a.s4.comm.topology.ClusterFromZK - Changing
cluster topology to {
nbNodes=1,name=cluster2,mode=unicast,type=,nodes=[{partition=0,port=13000,machineName=localhost,taskId=Task-0}]}
from null
10:39:24.213 [main] INFO  org.apache.s4.core.Server - Loaded application
from file /tmp/tmp2695149871633020370s4r
10:39:24.213 [main] INFO  o.a.s.d.DistributedDeploymentManager -
Successfully installed application twitter-counter
10:39:24.231 [main] DEBUG o.a.s.c.g.OverloadDispatcherGenerator -
Dumping generated overload dispatcher class for PE of class [class
org.apache.s4.example.twitter.TopNTopicPE]
10:39:24.249 [main] INFO  o.a.s4.example.twitter.TopNTopicPE - key: []
10:39:24.254 [main] DEBUG o.a.s.c.g.OverloadDispatcherGenerator -
Dumping generated overload dispatcher class for PE of class [class
org.apache.s4.example.twitter.TopicCountAndReportPE]
10:39:24.256 [main] DEBUG o.a.s.c.g.OverloadDispatcherGenerator -
Dumping generated overload dispatcher class for PE of class [class
org.apache.s4.example.twitter.TopicExtractorPE]
10:39:24.256 [main] DEBUG o.a.s4.comm.topology.ClustersFromZK - Adding
input stream [RawStatus] for app [-1] in cluster [cluster1]
10:39:24.332 [main] INFO  org.apache.s4.core.App - Init prototype
[org.apache.s4.example.twitter.TopNTopicPE].
10:39:24.334 [main] DEBUG org.apache.s4.core.ProcessingElement - Started
timer for PE prototype [org.apache.s4.example.twitter.TopNTopicPE], ID
[] with interval [10000].
10:39:24.335 [main] DEBUG org.apache.s4.core.ProcessingElement - Started
checkpointing timer for PE prototype
[org.apache.s4.example.twitter.TopNTopicPE], ID [] with interval [20]
[SECONDS].
10:39:24.335 [main] INFO  org.apache.s4.core.App - Init prototype
[org.apache.s4.example.twitter.TopicCountAndReportPE].
10:39:24.336 [main] DEBUG org.apache.s4.core.ProcessingElement - Started
timer for PE prototype
[org.apache.s4.example.twitter.TopicCountAndReportPE], ID [] with
interval [10000].
10:39:24.336 [main] INFO  org.apache.s4.core.App - Init prototype
[org.apache.s4.example.twitter.TopicExtractorPE].


This node halted here and did not work, until the adapter node on
machine2 failed and the standby node for adapter on machine1 worked.
Then the halting PE nodes on machine1 worked correctly, but the working
PE nodes on machine2 stopped and had logs as follows.

10:43:44.064 [ZkClient-EventThread-27-testing.machine1:2182] INFO
o.a.s4.comm.topology.ClusterFromZK - Changing cluster topology to {
nbNodes=0,name=unknown,mode=unicast,type=,nodes=[]} from {
nbNodes=1,name=cluster2,mode=unicast,type=,nodes=[{partition=0,port=13000,machineName=localhost,taskId=Task-0}]}
10:43:44.113 [ZkClient-EventThread-27-testing.machine1:2182] INFO
o.a.s4.comm.topology.ClusterFromZK - Changing cluster topology to {
nbNodes=1,name=cluster2,mode=unicast,type=,nodes=[{partition=0,port=13000,machineName=localhost,taskId=Task-0}]}
from { nbNodes=0,name=unknown,mode=unicast,type=,nodes=[]}


Does this mean that the PE nodes and adapter node should locate on the
same machine?
It seems that local PE nodes can not communicate with adapter node on
the remote machine.

Sincerely,
Yu Zheng







--
Sincerely,
Zheng Yu
Mobile:  (852) 60670059
Email:    bearzheng2011@gmail.com