accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Remotely Accumulo
Date Wed, 08 Oct 2014 14:09:03 GMT
Jstack is a tool which can be used to tell a java process to dump the
current stack traces for all of its threads. It's usually included with the
JDK. `kill -3 $pid` also does the same. If the output can't be respected
automatically to your shell, check the stdout for the process you gave as
an argument.

When your client is sitting waiting on data from the tabletserver, you can
get the stack traces from the tserver and you should be able to find a
thread with scan in the name, along with your client's IP, and we can help
debug exactly what the server is doing that is preventing it from returning
data to your client.
On Oct 8, 2014 9:43 AM, "Geoffry Roberts" <threadedblue@gmail.com> wrote:

> Thanks Josh.  But what do you mean my "jstack'ing"?  I'm unfamiliar with
> that term.  A better question would be how can one troubleshoot such a
> thing?
>
> btw
> I am the sole user on this cluster.
>
> On Tue, Oct 7, 2014 at 4:18 PM, Josh Elser <josh.elser@gmail.com> wrote:
>
>> Ok, this record:
>>
>> tcp        0      0 0.0.0.0:9997                0.0.0.0:*
>>      LISTEN
>>
>> Means that your is listening on the correct port on all interfaces.
>> There shouldn't be issues connecting to the tserver. This is also
>> confirmed by the fact that you authenticated and got a Connector (this
>> does an RPC to the tserver).
>>
>> So, your tserver is up, and your client can communicate with it. The
>> real question is why is the scan hanging. Perhaps jstack'ing the
>> tserver when your client is blocked waiting for results.
>>
>> On Tue, Oct 7, 2014 at 2:07 PM, Geoffry Roberts <threadedblue@gmail.com>
>> wrote:
>> > "...it's when
>> > you make a Connector, and your client will talk to a tabletserver to
>> > authenticate, that your program should hang. It would be good to
>> > verify that."
>> >
>> >
>> > My program should hang?  Would you expand?  That is exactly what it is
>> > doing.  I am able to get a connector.  But when I try to iterate the
>> result
>> > of a scan, that's when it hangs.
>> >
>> >
>> >
>> >
>> > Here's what comes from netstat:
>> >
>> >
>> > $ netstat -na | grep 9997
>> >
>> > tcp        0      0 0.0.0.0:9997                0.0.0.0:*
>> > LISTEN
>> >
>> > tcp        0      0 204.9.140.36:35679          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53146          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33896          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53282          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53188          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:35609          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33901          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:35588          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33877          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33946          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53167          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33949          204.9.140.38:9997
>> > ESTABLISHED
>> >
>> > tcp        0      0 204.9.140.36:35546          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33852          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53125          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33922          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33747          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33961          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33793          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:35768          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33917          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33814          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:35567          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33444          204.9.140.38:9997
>> > FIN_WAIT2
>> >
>> > tcp        0      0 204.9.140.36:35701          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33969          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53258          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33831          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53210          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53104          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33789          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33856          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53237          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33835          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:35651          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33938          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33041          204.9.140.36:9997
>> > ESTABLISHED
>> >
>> > tcp        0      0 204.9.140.36:53285          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:53305          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33768          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:35630          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33754          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:35745          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:35724          204.9.140.36:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:9997           204.9.140.36:33041
>> > ESTABLISHED
>> >
>> > tcp        0      0 204.9.140.36:53083          204.9.140.37:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:50623          204.9.140.37:9997
>> > ESTABLISHED
>> >
>> > tcp        0      0 204.9.140.36:33772          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33732          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33874          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> > tcp        0      0 204.9.140.36:33810          204.9.140.38:9997
>> > TIME_WAIT
>> >
>> >
>> > On Tue, Oct 7, 2014 at 11:34 AM, Josh Elser <josh.elser@gmail.com>
>> wrote:
>> >>
>> >> Can you provide the output from netstat, lsof or /proc/$pid/fd for the
>> >> tserver? Assuming you haven't altered tserv.port.client in
>> >> accumulo-site.xml, we want the line for port 9997.
>> >>
>> >> From my laptop running a tserver on localhost:
>> >>
>> >> $ netstat -na | grep 9997
>> >> tcp4       0      0  127.0.0.1.9997         *.*
>> LISTEN
>> >>
>> >> Depending on the tool you use, you can grep out the pid of the tserver
>> >> or just that port itself.
>> >>
>> >> Just so you know, ZK binds to all available interfaces when it starts,
>> >> so it should work seamlessly with localhost or the FQDN for the host.
>> >> As such, it shouldn't matter what you provide to the
>> >> ZooKeeperInstance. That should connect in all cases for you, it's when
>> >> you make a Connector, and your client will talk to a tabletserver to
>> >> authenticate, that your program should hang. It would be good to
>> >> verify that.
>> >>
>> >> On Tue, Oct 7, 2014 at 11:23 AM, Geoffry Roberts <
>> threadedblue@gmail.com>
>> >> wrote:
>> >> > All,
>> >> >
>> >> > Thanks for the responses.
>> >> >
>> >> > Is this a problem for Accumulo?
>> >> > Reverse DNS is yielding my ISP's host name. You know the drill, my
>> IP in
>> >> > reverse followed by their domain name, as opposed to my FQDN, which
>> what
>> >> > I
>> >> > use in my config files.
>> >> >
>> >> > Running Accumulo 1.5.1
>> >> > I have only one interface.
>> >> > I have the FQDN in both master and slaves files for both Hadoop and
>> >> > Accumulo; in zoo.cfg; and in accumulo-site.xml where the Zookeepers
>> are
>> >> > referenced.
>> >> > Also, I am passing in all Zk FQDN when I instantiate
>> ZookeeperInstance.
>> >> > Forward DNS works
>> >> > Reverse DNS... well (See above).
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Oct 6, 2014 at 10:26 PM, Adam Fuchs <afuchs@apache.org>
>> wrote:
>> >> >>
>> >> >> Accumulo tservers typically listen on a single interface. If you
>> have a
>> >> >> server with multiple interfaces (e.g. loopback and eth0), you might
>> >> >> have a
>> >> >> problem in which the tablet servers are not listening on externally
>> >> >> reachable interfaces. Tablet servers will list the interfaces that
>> they
>> >> >> are
>> >> >> listening to when they boot, and you can also use tools like lsof
to
>> >> >> find
>> >> >> them.
>> >> >>
>> >> >> If that is indeed the problem, then you might just need to change
>> you
>> >> >> conf/slaves file to use <hostname> instead of localhost,
and then
>> >> >> restart.
>> >> >>
>> >> >> Adam
>> >> >>
>> >> >> On Oct 6, 2014 4:27 PM, "Geoffry Roberts" <threadedblue@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>>
>> >> >>> I have been happily working with Acc, but today things changed.
 No
>> >> >>> errors
>> >> >>>
>> >> >>> Until now I ran everything server side, which meant the URL
was
>> >> >>> localhost:2181, and life was good.  Today tried running some
of the
>> >> >>> same
>> >> >>> code as a remote client, which means <host name>:2181.
 Things hang
>> >> >>> when
>> >> >>> BatchWriter tries to commit anything and Scan hangs when it
tries
>> to
>> >> >>> iterate
>> >> >>> through a Map.
>> >> >>>
>> >> >>> Let's focus on the scan part:
>> >> >>>
>> >> >>> scan.fetchColumnFamily(new Text("colfY")); // This executes
then
>> >> >>> hangs.
>> >> >>> for(Entry<Key,Value> entry : scan) {
>> >> >>> def row = entry.getKey().getRow();
>> >> >>> def value = entry.getValue();
>> >> >>> println "value=" + value;
>> >> >>> }
>> >> >>>
>> >> >>> This is what appears in the console :
>> >> >>>
>> >> >>> 17:22:39.802 C{0} M DEBUG org.apache.zookeeper.ClientCnxn -
Got
>> ping
>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms
>> >> >>>
>> >> >>> 17:22:49.803 C{0} M DEBUG org.apache.zookeeper.ClientCnxn -
Got
>> ping
>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms
>> >> >>>
>> >> >>> <and on and on>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> The only difference between success and a hang is a URL change,
>> and of
>> >> >>> course being remote.
>> >> >>>
>> >> >>> I don't believe this is a firewall issue.  I shutdown the firewall.
>> >> >>>
>> >> >>> Am I missing something?
>> >> >>>
>> >> >>> Thanks all.
>> >> >>>
>> >> >>> --
>> >> >>> There are ways and there are ways,
>> >> >>>
>> >> >>> Geoffry Roberts
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > There are ways and there are ways,
>> >> >
>> >> > Geoffry Roberts
>> >
>> >
>> >
>> >
>> > --
>> > There are ways and there are ways,
>> >
>> > Geoffry Roberts
>>
>
>
>
> --
> There are ways and there are ways,
>
> Geoffry Roberts
>

Mime
View raw message