incubator-s4-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthieu Morel <mmo...@apache.org>
Subject Re: Run Twitter Trending Example on Multi Machines
Date Fri, 21 Sep 2012 08:16:01 GMT
This looks like a connectivity problem, or something with your 
configuration.

I would recommend to :
- use the "s4 status" tool to check what apps are deployed and which 
nodes are running.
- check connectivity from whatever host you are running processing that 
timeout for the zookeeper connection. You can try something like "nc -z 
testing.machine1 2182"
- check Zookeeper logs and look for errors

Let us know!

Regards,

Matthieu


On 9/21/12 4:40 AM, Frank Zheng wrote:
> Hi Matthieu,
>
> I reconfigured the /etc/hosts file, matching real IP address with local
> machine name, instead of 127.0.0.1.
> Then it worked!
> Thank you so much.
>
> Now here comes another problem.
> I followed the steps of Run Twitter Trending Example. And I set up two
> newCluster on the same server testing.machine1:2182.
> Then I set up two nodes of cluster1 and one node of cluster2 on the same
> server testing.machine1:2182.
>
> When I deployed twitter-counter app on cluster1, there was no problem
> When I deployed twitter-adapter app on cluster2, it did not work.
>
> [root@testing apache-s4-0.5.0-incubating-src]# ./s4 deploy
> -s4r=/usr/apache-s4-0.5.0-incubating-src/test-apps/twitter-adapter/build/libs/twitter-adapter.s4r
> -c=cluster2 -appName=twitter-adapter -zk=testing.machine1:2182
> 10:31:27.178 [main] ERROR org.apache.s4.tools.Deploy - Cannot deploy app
> org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to
> zookeeper server within timeout: 10000
>      at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:876)
> ~[zkclient-0.1.jar:na]
>      at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
> ~[zkclient-0.1.jar:na]
>      at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:92)
> ~[zkclient-0.1.jar:na]
>      at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:76)
> ~[zkclient-0.1.jar:na]
>      at org.apache.s4.tools.Deploy.main(Deploy.java:59)
> ~[s4-tools-0.5.0-incubating.jar:0.5.0-incubating]
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[na:1.6.0_22]
>      at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> ~[na:1.6.0_22]
>      at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[na:1.6.0_22]
>      at java.lang.reflect.Method.invoke(Method.java:616) ~[na:1.6.0_22]
>      at org.apache.s4.tools.Tools$Task.dispatch(Tools.java:54)
> [s4-tools-0.5.0-incubating.jar:0.5.0-incubating]
>      at org.apache.s4.tools.Tools.main(Tools.java:94)
> [s4-tools-0.5.0-incubating.jar:0.5.0-incubating]
>
>
> Then I deployed twitter-adapter app on another server on machine1,
> machine1:2183. It worked.
>
> [root@testing apache-s4-0.5.0-incubating-src]# ./s4 deploy
> -s4r=/usr/apache-s4-0.5.0-incubating-src/test-apps/twitter-adapter/build/libs/twitter-adapter.s4r
> -c=cluster2 -appName=twitter-adapter -zk=testing.machine1:2183
> 10:33:00.830 [main] INFO  org.apache.s4.tools.Deploy - Using specified
> S4R
> [/usr/apache-s4-0.5.0-incubating-src/test-apps/twitter-adapter/build/libs/twitter-adapter.s4r],
> the S4R archive will not be built from source (and corresponding
> parameters are ignored)
> 10:33:00.911 [main] INFO  org.apache.s4.tools.Deploy - uploaded
> application [twitter-adapter] to cluster [cluster2], using zookeeper
> znode [/s4/clusters/cluster2/app/twitter-adapter], and s4r file
> [/usr/apache-s4-0.5.0-incubating-src/test-apps/twitter-adapter/build/libs/twitter-adapter.s4r]
>
>
> Then I checked the logs of PE node, the error is as follows.
>
> [root@testing apache-s4-0.5.0-incubating-src]# ./s4 node -c=cluster1
> -zk=testing.machine1:2182
> 10:36:05.165 [main] INFO  org.apache.s4.core.Main - Initializing S4 node
> with :
> - comm module class [org.apache.s4.comm.DefaultCommModule]
> - comm configuration file [default.s4.comm.properties from classpath]
> - core module class [org.apache.s4.core.DefaultCoreModule]
> - core configuration file[default.s4.core.properties from classpath]
> - extra modules: []
> - inline parameters: []
> 10:36:05.175 [main] DEBUG org.apache.s4.core.Main - Adding named
> parameters for injection : [s4.cluster.zk_address=testing.machine1:2182]
> 10:36:05.525 [main] INFO  org.apache.s4.core.Main - Starting S4 node.
> This node will automatically download applications published for the
> cluster it belongs to
> 10:36:16.745 [main] ERROR org.apache.s4.core.Main - Cannot start S4 node
> com.google.inject.ProvisionException: Guice provision errors:
>
> 1) Error injecting constructor,
> org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to
> zookeeper server within timeout: 10000
>    at org.apache.s4.core.Server.<init>(Server.java:71)
>    while locating org.apache.s4.core.Server
>
> 1 error
>      at
> com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:987)
> ~[guice-3.0.jar:na]
>      at
> com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1013)
> ~[guice-3.0.jar:na]
>      at org.apache.s4.core.Main.startNode(Main.java:148)
> [s4-core-0.5.0-incubating.jar:0.5.0-incubating]
>      at org.apache.s4.core.Main.main(Main.java:75)
> [s4-core-0.5.0-incubating.jar:0.5.0-incubating]
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[na:1.6.0_22]
>      at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> ~[na:1.6.0_22]
>      at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[na:1.6.0_22]
>      at java.lang.reflect.Method.invoke(Method.java:616) ~[na:1.6.0_22]
>      at org.apache.s4.tools.Tools$Task.dispatch(Tools.java:54)
> [s4-tools-0.5.0-incubating.jar:0.5.0-incubating]
>      at org.apache.s4.tools.Tools.main(Tools.java:94)
> [s4-tools-0.5.0-incubating.jar:0.5.0-incubating]
> Caused by: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to
> connect to zookeeper server within timeout: 10000
>      at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:876)
> ~[zkclient-0.1.jar:na]
>      at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
> ~[zkclient-0.1.jar:na]
>      at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:92)
> ~[zkclient-0.1.jar:na]
>      at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:80)
> ~[zkclient-0.1.jar:na]
>      at org.apache.s4.core.Server.<init>(Server.java:74)
> ~[s4-core-0.5.0-incubating.jar:0.5.0-incubating]
>      at
> org.apache.s4.core.Server$$FastClassByGuice$$69e0fd5b.newInstance(<generated>)
> ~[guice-3.0.jar:0.5.0-incubating]
>      at
> com.google.inject.internal.cglib.reflect.$FastConstructor.newInstance(FastConstructor.java:40)
> ~[guice-3.0.jar:na]
>      at
> com.google.inject.internal.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:60)
> ~[guice-3.0.jar:na]
>      at
> com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:85)
> ~[guice-3.0.jar:na]
>      at
> com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:254)
> ~[guice-3.0.jar:na]
>      at
> com.google.inject.internal.InjectorImpl$4$1.call(InjectorImpl.java:978)
> ~[guice-3.0.jar:na]
>      at
> com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1024)
> ~[guice-3.0.jar:na]
>      at
> com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:974)
> ~[guice-3.0.jar:na]
>      ... 9 common frames omitted
>
>
>
> Looking forward to your reply.
> Thanks.
>
> Sincerely,
> Yu
>
>
>
> On Thu, Sep 20, 2012 at 5:25 PM, Matthieu Morel <mmorel@apache.org
> <mailto:mmorel@apache.org>> wrote:
>
>     Hi,
>
>     as far as I can tell from the logs, the local host name of the node
>     is not resolved correctly: it is resolved as "localhost" instead of
>     the fully qualified host name.
>
>     S4 currently uses the following method to resolve the local host name:
>     InetAddress.getLocalHost().__getCanonicalHostName()
>
>     Note that getLocalHost() will return the loopback address if you
>     have a security manager that doesn't allow to resolve the localhost.
>     Otherwise there might be something wrong with you /etc/hosts file
>     (or equivalent).
>
>     Hope this helps,
>
>     Matthieu
>
>
>
>     On 9/20/12 4:49 AM, Frank Zheng wrote:
>
>         Hi,
>
>         I ran the Twitter Trending Example on two machines to test the S4
>         Fail-over Mechanism.
>         Firstly I set up five ZooKeeper servers, three on machine1 and
>         two on
>         machine2.
>         Then I set up two PE nodes and one adapter node on machine2.
>         Afterwards
>         I set up three standby nodes on machine1, two for PE and one for
>         adapter.
>         When I shut down one PE node on machine2, the ZooKeeper distributed
>         tasks to one standby node for PE on machine1. But that node got
>         tasks
>         and did not work correctly.
>
>         Standby PE Node on machine1
>
>         10:35:57.742 [main] INFO  org.apache.s4.core.Main - Initializing
>         S4 node
>         with :
>         - comm module class [org.apache.s4.comm.__DefaultCommModule]
>         - comm configuration file [default.s4.comm.properties from
>         classpath]
>         - core module class [org.apache.s4.core.__DefaultCoreModule]
>         - core configuration file[default.s4.core.__properties from
>         classpath]
>         - extra modules: []
>         - inline parameters: []
>         10:35:57.752 [main] DEBUG org.apache.s4.core.Main - Adding named
>         parameters for injection :
>         [s4.cluster.zk_address=__testing.machine1:2182]
>         10:35:58.073 [main] INFO  org.apache.s4.core.Main - Starting S4
>         node.
>         This node will automatically download applications published for the
>         cluster it belongs to
>         10:35:58.175 [main] INFO  o.a.s.comm.topology.__AssignmentFromZK
>         - New
>         session:88349612453724188; state is : SyncConnected
>         10:35:58.185 [main] INFO  o.a.s.comm.topology.__AssignmentFromZK
>         - Could
>         not acquire task. Going into standby mode
>         10:35:58.254 [main] INFO  org.apache.s4.core.Server - Loading
>         application [twitter-counter] from file
>         [/tmp/__tmp3353582636362855640s4r]
>         10:35:58.255 [main] WARN  o.a.s4.base.util.__S4RLoaderFactory -
>         s4.tmp.dir
>         not specified, using temporary directory [/tmp/1348108558254-0] for
>         unpacking S4R. You may want to specify a parent non-temporary
>         directory.
>         10:35:58.255 [main] INFO  o.a.s4.base.util.__S4RLoaderFactory -
>         Unzipping
>         S4R archive in [/tmp/1348108558254-0]
>         10:35:58.351 [main] INFO  org.apache.s4.core.Server - App class
>         name is:
>         org.apache.s4.example.twitter.__TwitterCounterApp
>         10:35:58.423 [main] INFO  o.a.s4.comm.topology.__ClusterFromZK -
>         Changing
>         cluster topology to {
>         nbNodes=2,name=cluster1,mode=__unicast,type=,nodes=[{__partition=0,port=12000,__machineName=localhost,taskId=__Task-0},
>         {partition=1,port=12001,__machineName=localhost,taskId=__Task-1}]}
>         from null
>         10:35:58.458 [main] INFO  o.a.s4.comm.topology.__ClusterFromZK -
>         Adding
>         topology change listener:org.apache.s4.comm.__tcp.TCPEmitter@e4c6320
>
>
>         When one working PE node failed on machine2, the standby PE node had
>         logs as follows
>
>         10:39:24.047 [ZkClient-EventThread-19-__testing.machine1:2182] INFO
>         o.a.s4.comm.topology.__ClusterFromZK - Changing cluster topology
>         to {
>         nbNodes=1,name=cluster1,mode=__unicast,type=,nodes=[{__partition=1,port=12001,__machineName=localhost,taskId=__Task-1}]}
>         from {
>         nbNodes=2,name=cluster1,mode=__unicast,type=,nodes=[{__partition=0,port=12000,__machineName=localhost,taskId=__Task-0},
>         {partition=1,port=12001,__machineName=localhost,taskId=__Task-1}]}
>         10:39:24.102 [ZkClient-EventThread-16-__testing.machine1:2182] INFO
>         o.a.s.comm.topology.__AssignmentFromZK - Successfully acquired
>         task:Task-0
>         by localhost
>         10:39:24.116 [ZkClient-EventThread-19-__testing.machine1:2182] INFO
>         o.a.s4.comm.topology.__ClusterFromZK - Changing cluster topology
>         to {
>         nbNodes=2,name=cluster1,mode=__unicast,type=,nodes=[{__partition=0,port=12000,__machineName=localhost,taskId=__Task-0},
>         {partition=1,port=12001,__machineName=localhost,taskId=__Task-1}]}
>         from {
>         nbNodes=1,name=cluster1,mode=__unicast,type=,nodes=[{__partition=1,port=12001,__machineName=localhost,taskId=__Task-1}]}
>         10:39:24.159 [main] INFO  o.a.s4.comm.topology.__ClustersFromZK
>         - New
>         session:88349612453724194
>         10:39:24.162 [main] INFO  o.a.s4.comm.topology.__ClustersFromZK
>         - Detected
>         new stream [RawStatus]
>         10:39:24.193 [main] INFO  o.a.s4.comm.topology.__ClustersFromZK
>         - New
>         session:88349612453724195
>         10:39:24.205 [main] INFO  o.a.s4.comm.topology.__ClusterFromZK -
>         Changing
>         cluster topology to {
>         nbNodes=2,name=cluster1,mode=__unicast,type=,nodes=[{__partition=0,port=12000,__machineName=localhost,taskId=__Task-0},
>         {partition=1,port=12001,__machineName=localhost,taskId=__Task-1}]}
>         from null
>         10:39:24.212 [main] INFO  o.a.s4.comm.topology.__ClusterFromZK -
>         Changing
>         cluster topology to {
>         nbNodes=1,name=cluster2,mode=__unicast,type=,nodes=[{__partition=0,port=13000,__machineName=localhost,taskId=__Task-0}]}
>         from null
>         10:39:24.213 [main] INFO  org.apache.s4.core.Server - Loaded
>         application
>         from file /tmp/tmp2695149871633020370s4r
>         10:39:24.213 [main] INFO  o.a.s.d.__DistributedDeploymentManager -
>         Successfully installed application twitter-counter
>         10:39:24.231 [main] DEBUG o.a.s.c.g.__OverloadDispatcherGenerator -
>         Dumping generated overload dispatcher class for PE of class [class
>         org.apache.s4.example.twitter.__TopNTopicPE]
>         10:39:24.249 [main] INFO  o.a.s4.example.twitter.__TopNTopicPE -
>         key: []
>         10:39:24.254 [main] DEBUG o.a.s.c.g.__OverloadDispatcherGenerator -
>         Dumping generated overload dispatcher class for PE of class [class
>         org.apache.s4.example.twitter.__TopicCountAndReportPE]
>         10:39:24.256 [main] DEBUG o.a.s.c.g.__OverloadDispatcherGenerator -
>         Dumping generated overload dispatcher class for PE of class [class
>         org.apache.s4.example.twitter.__TopicExtractorPE]
>         10:39:24.256 [main] DEBUG o.a.s4.comm.topology.__ClustersFromZK
>         - Adding
>         input stream [RawStatus] for app [-1] in cluster [cluster1]
>         10:39:24.332 [main] INFO  org.apache.s4.core.App - Init prototype
>         [org.apache.s4.example.__twitter.TopNTopicPE].
>         10:39:24.334 [main] DEBUG org.apache.s4.core.__ProcessingElement
>         - Started
>         timer for PE prototype
>         [org.apache.s4.example.__twitter.TopNTopicPE], ID
>         [] with interval [10000].
>         10:39:24.335 [main] DEBUG org.apache.s4.core.__ProcessingElement
>         - Started
>         checkpointing timer for PE prototype
>         [org.apache.s4.example.__twitter.TopNTopicPE], ID [] with
>         interval [20]
>         [SECONDS].
>         10:39:24.335 [main] INFO  org.apache.s4.core.App - Init prototype
>         [org.apache.s4.example.__twitter.TopicCountAndReportPE]__.
>         10:39:24.336 [main] DEBUG org.apache.s4.core.__ProcessingElement
>         - Started
>         timer for PE prototype
>         [org.apache.s4.example.__twitter.TopicCountAndReportPE]__, ID []
>         with
>         interval [10000].
>         10:39:24.336 [main] INFO  org.apache.s4.core.App - Init prototype
>         [org.apache.s4.example.__twitter.TopicExtractorPE].
>
>
>         This node halted here and did not work, until the adapter node on
>         machine2 failed and the standby node for adapter on machine1 worked.
>         Then the halting PE nodes on machine1 worked correctly, but the
>         working
>         PE nodes on machine2 stopped and had logs as follows.
>
>         10:43:44.064 [ZkClient-EventThread-27-__testing.machine1:2182] INFO
>         o.a.s4.comm.topology.__ClusterFromZK - Changing cluster topology
>         to {
>         nbNodes=0,name=unknown,mode=__unicast,type=,nodes=[]} from {
>         nbNodes=1,name=cluster2,mode=__unicast,type=,nodes=[{__partition=0,port=13000,__machineName=localhost,taskId=__Task-0}]}
>         10:43:44.113 [ZkClient-EventThread-27-__testing.machine1:2182] INFO
>         o.a.s4.comm.topology.__ClusterFromZK - Changing cluster topology
>         to {
>         nbNodes=1,name=cluster2,mode=__unicast,type=,nodes=[{__partition=0,port=13000,__machineName=localhost,taskId=__Task-0}]}
>         from { nbNodes=0,name=unknown,mode=__unicast,type=,nodes=[]}
>
>
>         Does this mean that the PE nodes and adapter node should locate
>         on the
>         same machine?
>         It seems that local PE nodes can not communicate with adapter
>         node on
>         the remote machine.
>
>         Sincerely,
>         Yu Zheng
>
>
>
>
>
>
>
> --
> Sincerely,
> Zheng Yu
> Mobile:  (852) 60670059
> Email: bearzheng2011@gmail.com <mailto:bearzheng2011@gmail.com>
>
>
>


Mime
View raw message