accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: Communication issue between zookeeper and accumulo
Date Thu, 08 Aug 2013 15:16:34 GMT
I thought I fixed some lock issue for 1.4.4., I looked at fixes for 1.4.4.
 You may be running into ACCUMULO-1277[1].  I just looked at the 1.4.3 code
to see how it would be behave.  I think it would timeout like you are
seeing.   If we can confirm this, then it would be worthwhile posting your
log messages about waiting and "could not obtain lock" on the ticket so
that its easier to find the issue via google.

https://issues.apache.org/jira/browse/ACCUMULO-1277


On Thu, Aug 8, 2013 at 10:03 AM, Ray Pfaff <ray.pfaff@apx-labs.com> wrote:

>  I'm trying to see if I can post the entire log somewhere.  In the
> interim, this is a copy of the error as it appears in the log file.
>
>  2013-08-01 10:15:55,980 [tabletserver.TabletServer] INFO : Tablet server
> starting on 10.1.3.227
> 2013-08-01 10:15:56,087 [util.FileSystemMonitor] INFO : Filesystem monitor
> started
> 2013-08-01 10:15:56,121 [tabletserver.NativeMap] INFO : Loaded native map
> shared library
> /opt/accumulo/accumulo-current/lib/native/map/libNativeMap-Linux-tile-64.so
> 2013-08-01 10:15:57,394 [tabletserver.TabletServer] INFO : port = 9997
> 2013-08-01 10:15:57,493 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:02,504 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:07,517 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:12,528 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:17,539 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:22,550 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:27,566 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:32,582 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:37,594 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:42,607 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:47,617 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:52,628 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:16:57,639 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:02,650 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:07,662 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:12,672 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:17,690 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:22,701 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:27,711 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:32,724 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:37,735 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:42,745 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:47,763 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:52,774 [tabletserver.TabletServer] INFO : Waiting for
> tablet server lock
> 2013-08-01 10:17:57,775 [tabletserver.TabletServer] INFO : Too many
> retries, exiting.
> 2013-08-01 10:17:57,778 [tabletserver.TabletServer] INFO : Could not
> obtain tablet server lock, exiting.
> java.lang.RuntimeException: Too many retries, exiting.
> at
> org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2681)
> at
> org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.java:2703)
> at
> org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.java:3168)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at org.apache.accumulo.start.Main$1.run(Main.java:89)
> at java.lang.Thread.run(Thread.java:636)
> 2013-08-01 10:17:57,786 [tabletserver.TabletServer] ERROR: Uncaught
> exception in TabletServer.main, exiting
> java.lang.RuntimeException: java.lang.RuntimeException: Too many retries,
> exiting.
> at
> org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2684)
> at
> org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.java:2703)
> at
> org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.java:3168)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:616)
> at org.apache.accumulo.start.Main$1.run(Main.java:89)
> at java.lang.Thread.run(Thread.java:636)
> Caused by: java.lang.RuntimeException: Too many retries, exiting.
> at
> org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2681)
> ... 8 more
>
>   From: Sean Busbey <busbey@cloudera.com>
> Reply-To: "user@accumulo.apache.org" <user@accumulo.apache.org>
> Date: Wednesday, August 7, 2013 7:25 PM
>
> To: Accumulo User List <user@accumulo.apache.org>
> Subject: Re: Communication issue between zookeeper and accumulo
>
>   Can you post the full logs from the tablet servers somewhere and send a
> link?
>
>
>
> On Tue, Aug 6, 2013 at 10:40 AM, Ray Pfaff <ray.pfaff@apx-labs.com> wrote:
>
>>  It's from one of the tablet servers, but looking at one of the
>> zookeeper servers, it's exactly the same
>>
>>   From: Sean Busbey <busbey@cloudera.com>
>> Reply-To: "user@accumulo.apache.org" <user@accumulo.apache.org>
>>  Date: Tuesday, August 6, 2013 1:35 PM
>>
>> To: Accumulo User List <user@accumulo.apache.org>
>> Subject: Re: Communication issue between zookeeper and accumulo
>>
>>   Is that on the ZK server or the TabletServer? Can we also see the
>> other?
>>
>>
>> On Tue, Aug 6, 2013 at 10:33 AM, Ray Pfaff <ray.pfaff@apx-labs.com>wrote:
>>
>>>  Chain INPUT (policy ACCEPT)
>>> target     prot opt source               destination
>>> ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:ssh
>>> ACCEPT     icmp --  anywhere             anywhere            icmp
>>> echo-reply
>>> ACCEPT     icmp --  anywhere             anywhere            icmp
>>> echo-request
>>> ACCEPT     tcp  --  anywhere             anywhere            tcp
>>> dpt:nrpe
>>> ACCEPT     udp  --  anywhere             anywhere            udp
>>> dpt:domain
>>>
>>>  Chain FORWARD (policy DROP)
>>> target     prot opt source               destination
>>>
>>>  Chain OUTPUT (policy ACCEPT)
>>> target     prot opt source               destination
>>>
>>>   From: Brendan Heussler <bheussler@gmail.com>
>>> Reply-To: "user@accumulo.apache.org" <user@accumulo.apache.org>
>>> Date: Tuesday, August 6, 2013 1:27 PM
>>> To: "user@accumulo.apache.org" <user@accumulo.apache.org>
>>>
>>> Subject: Re: Communication issue between zookeeper and accumulo
>>>
>>>   What is the output of iptables --list?
>>>
>>>
>>>
>>> Brendan
>>>
>>>
>>> On Tue, Aug 6, 2013 at 1:25 PM, Ray Pfaff <ray.pfaff@apx-labs.com>wrote:
>>>
>>>>  Not sure what you mean.  I get the error "Fatal ip6_tables not
>>>> found."  I'm assuming that means disabled?
>>>>
>>>>   From: <Ott>, "Charles H." <CHARLES.H.OTT@saic.com>
>>>> Reply-To: "user@accumulo.apache.org" <user@accumulo.apache.org>
>>>> Date: Tuesday, August 6, 2013 1:18 PM
>>>> To: "user@accumulo.apache.org" <user@accumulo.apache.org>
>>>> Subject: RE: Communication issue between zookeeper and accumulo
>>>>
>>>>   And iptables?****
>>>>
>>>> ** **
>>>>
>>>> *From:* user-return-2837-CHARLES.H.OTT=saic.com@accumulo.apache.org [
>>>> mailto:user-return-2837-CHARLES.H.OTT=saic.com@accumulo.apache.org<user-return-2837-CHARLES.H.OTT=saic.com@accumulo.apache.org>]
>>>> *On Behalf Of *Ray Pfaff
>>>> *Sent:* Tuesday, August 06, 2013 12:54 PM
>>>> *To:* user@accumulo.apache.org
>>>> *Subject:* Re: Communication issue between zookeeper and accumulo****
>>>>
>>>> ** **
>>>>
>>>> Yes, it is disabled, so that's not the problem.****
>>>>
>>>> ** **
>>>>
>>>> *From: *Sean Busbey <busbey@cloudera.com>
>>>> *Reply-To: *"user@accumulo.apache.org" <user@accumulo.apache.org>
>>>> *Date: *Tuesday, August 6, 2013 12:48 PM
>>>> *To: *Accumulo User List <user@accumulo.apache.org>
>>>> *Subject: *Re: Communication issue between zookeeper and accumulo****
>>>>
>>>> ** **
>>>>
>>>> Hi Ray! ****
>>>>
>>>> ** **
>>>>
>>>> Can you confirm that IPv6 is disabled?****
>>>>
>>>> ** **
>>>>
>>>> On Tue, Aug 6, 2013 at 9:19 AM, Ray Pfaff <ray.pfaff@apx-labs.com>
>>>> wrote:****
>>>>
>>>> I'm not sure if I can provide those due to the contract I'm working.  I
>>>> really don't want to diverge this conversation from the original question
>>>> I'm asking (which is a problem even running one tablet server per machine)
>>>> but are you saying that setting tserver.port.search = true shouldn't be
>>>> done?  I found this to be an undocumented way of running more than one
>>>> tablet server per system.  I'm still not convinced that this leads to
>>>> stability issues on tablet servers.  As I said, it's undocumented.****
>>>>
>>>> ** **
>>>>
>>>> *From: *Eric Newton <eric.newton@gmail.com>
>>>> *Reply-To: *"user@accumulo.apache.org" <user@accumulo.apache.org>****
>>>>
>>>> *Date: *Tuesday, August 6, 2013 11:12 AM ****
>>>>
>>>>
>>>> *To: *"user@accumulo.apache.org" <user@accumulo.apache.org>
>>>> *Subject: *Re: Communication issue between zookeeper and accumulo****
>>>>
>>>> ** **
>>>>
>>>> Interesting.  You could not get similar performance improvements by
>>>> increasing the size of the JVM, the number of threads, or the number of
>>>> tablets per server? ****
>>>>
>>>> ** **
>>>>
>>>> If you have details about what configurations you've tried and the
>>>> performance numbers you found, please open a ticket.  This would indicate
>>>> that we have some unnecessary bottleneck in the tserver.****
>>>>
>>>> ** **
>>>>
>>>> -Eric****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> On Tue, Aug 6, 2013 at 11:00 AM, Ray Pfaff <ray.pfaff@apx-labs.com>
>>>> wrote:****
>>>>
>>>> Because we found this to be the optimal number of tablet servers in our
>>>> testing.  It performs better than one per machine.  I'm not convinced that
>>>> the stability issues make it worthwhile.****
>>>>
>>>> Doesn't affect my problem anyway.  I get this error whether I run one
>>>> or four tablet servers.  Running four just makes it a bigger issue to get
>>>> back up after failure.****
>>>>
>>>> ** **
>>>>
>>>> *From: *Eric Newton <eric.newton@gmail.com>
>>>> *Reply-To: *"user@accumulo.apache.org" <user@accumulo.apache.org>
>>>> *Date: *Tuesday, August 6, 2013 10:56 AM
>>>> *To: *"user@accumulo.apache.org" <user@accumulo.apache.org>
>>>> *Subject: *Re: Communication issue between zookeeper and accumulo****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>>  I'm running 4 tservers per machine dedicated to the tablet servers****
>>>>
>>>>  ** **
>>>>
>>>> Why?****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>
>
>  --
> Sean
>

Mime
View raw message