accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geoffry Roberts <threadedb...@gmail.com>
Subject Re: Remotely Accumulo
Date Wed, 08 Oct 2014 22:39:27 GMT
Just for the record, I finally got to the bottom of things.  One of my
Tservers was running out of memory.  I hadn't noticed.  I had my SA
allocate a lttle more--each node now has 6G up from 2G--and things are
working better.
 On Oct 8, 2014 10:09 AM, "Josh Elser" <josh.elser@gmail.com> wrote:

> Jstack is a tool which can be used to tell a java process to dump the
> current stack traces for all of its threads. It's usually included with the
> JDK. `kill -3 $pid` also does the same. If the output can't be respected
> automatically to your shell, check the stdout for the process you gave as
> an argument.
>
> When your client is sitting waiting on data from the tabletserver, you can
> get the stack traces from the tserver and you should be able to find a
> thread with scan in the name, along with your client's IP, and we can help
> debug exactly what the server is doing that is preventing it from returning
> data to your client.
> On Oct 8, 2014 9:43 AM, "Geoffry Roberts" <threadedblue@gmail.com> wrote:
>
>> Thanks Josh.  But what do you mean my "jstack'ing"?  I'm unfamiliar with
>> that term.  A better question would be how can one troubleshoot such a
>> thing?
>>
>> btw
>> I am the sole user on this cluster.
>>
>> On Tue, Oct 7, 2014 at 4:18 PM, Josh Elser <josh.elser@gmail.com> wrote:
>>
>>> Ok, this record:
>>>
>>> tcp        0      0 0.0.0.0:9997                0.0.0.0:*
>>>      LISTEN
>>>
>>> Means that your is listening on the correct port on all interfaces.
>>> There shouldn't be issues connecting to the tserver. This is also
>>> confirmed by the fact that you authenticated and got a Connector (this
>>> does an RPC to the tserver).
>>>
>>> So, your tserver is up, and your client can communicate with it. The
>>> real question is why is the scan hanging. Perhaps jstack'ing the
>>> tserver when your client is blocked waiting for results.
>>>
>>> On Tue, Oct 7, 2014 at 2:07 PM, Geoffry Roberts <threadedblue@gmail.com>
>>> wrote:
>>> > "...it's when
>>> > you make a Connector, and your client will talk to a tabletserver to
>>> > authenticate, that your program should hang. It would be good to
>>> > verify that."
>>> >
>>> >
>>> > My program should hang?  Would you expand?  That is exactly what it is
>>> > doing.  I am able to get a connector.  But when I try to iterate the
>>> result
>>> > of a scan, that's when it hangs.
>>> >
>>> >
>>> >
>>> >
>>> > Here's what comes from netstat:
>>> >
>>> >
>>> > $ netstat -na | grep 9997
>>> >
>>> > tcp        0      0 0.0.0.0:9997                0.0.0.0:*
>>> > LISTEN
>>> >
>>> > tcp        0      0 204.9.140.36:35679          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53146          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33896          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53282          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53188          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:35609          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33901          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:35588          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33877          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33946          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53167          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33949          204.9.140.38:9997
>>> > ESTABLISHED
>>> >
>>> > tcp        0      0 204.9.140.36:35546          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33852          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53125          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33922          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33747          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33961          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33793          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:35768          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33917          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33814          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:35567          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33444          204.9.140.38:9997
>>> > FIN_WAIT2
>>> >
>>> > tcp        0      0 204.9.140.36:35701          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33969          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53258          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33831          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53210          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53104          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33789          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33856          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53237          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33835          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:35651          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33938          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33041          204.9.140.36:9997
>>> > ESTABLISHED
>>> >
>>> > tcp        0      0 204.9.140.36:53285          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:53305          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33768          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:35630          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33754          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:35745          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:35724          204.9.140.36:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:9997           204.9.140.36:33041
>>> > ESTABLISHED
>>> >
>>> > tcp        0      0 204.9.140.36:53083          204.9.140.37:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:50623          204.9.140.37:9997
>>> > ESTABLISHED
>>> >
>>> > tcp        0      0 204.9.140.36:33772          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33732          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33874          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> > tcp        0      0 204.9.140.36:33810          204.9.140.38:9997
>>> > TIME_WAIT
>>> >
>>> >
>>> > On Tue, Oct 7, 2014 at 11:34 AM, Josh Elser <josh.elser@gmail.com>
>>> wrote:
>>> >>
>>> >> Can you provide the output from netstat, lsof or /proc/$pid/fd for the
>>> >> tserver? Assuming you haven't altered tserv.port.client in
>>> >> accumulo-site.xml, we want the line for port 9997.
>>> >>
>>> >> From my laptop running a tserver on localhost:
>>> >>
>>> >> $ netstat -na | grep 9997
>>> >> tcp4       0      0  127.0.0.1.9997         *.*
>>> LISTEN
>>> >>
>>> >> Depending on the tool you use, you can grep out the pid of the tserver
>>> >> or just that port itself.
>>> >>
>>> >> Just so you know, ZK binds to all available interfaces when it starts,
>>> >> so it should work seamlessly with localhost or the FQDN for the host.
>>> >> As such, it shouldn't matter what you provide to the
>>> >> ZooKeeperInstance. That should connect in all cases for you, it's when
>>> >> you make a Connector, and your client will talk to a tabletserver to
>>> >> authenticate, that your program should hang. It would be good to
>>> >> verify that.
>>> >>
>>> >> On Tue, Oct 7, 2014 at 11:23 AM, Geoffry Roberts <
>>> threadedblue@gmail.com>
>>> >> wrote:
>>> >> > All,
>>> >> >
>>> >> > Thanks for the responses.
>>> >> >
>>> >> > Is this a problem for Accumulo?
>>> >> > Reverse DNS is yielding my ISP's host name. You know the drill,
my
>>> IP in
>>> >> > reverse followed by their domain name, as opposed to my FQDN, which
>>> what
>>> >> > I
>>> >> > use in my config files.
>>> >> >
>>> >> > Running Accumulo 1.5.1
>>> >> > I have only one interface.
>>> >> > I have the FQDN in both master and slaves files for both Hadoop
and
>>> >> > Accumulo; in zoo.cfg; and in accumulo-site.xml where the Zookeepers
>>> are
>>> >> > referenced.
>>> >> > Also, I am passing in all Zk FQDN when I instantiate
>>> ZookeeperInstance.
>>> >> > Forward DNS works
>>> >> > Reverse DNS... well (See above).
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Mon, Oct 6, 2014 at 10:26 PM, Adam Fuchs <afuchs@apache.org>
>>> wrote:
>>> >> >>
>>> >> >> Accumulo tservers typically listen on a single interface. If
you
>>> have a
>>> >> >> server with multiple interfaces (e.g. loopback and eth0), you
might
>>> >> >> have a
>>> >> >> problem in which the tablet servers are not listening on externally
>>> >> >> reachable interfaces. Tablet servers will list the interfaces
that
>>> they
>>> >> >> are
>>> >> >> listening to when they boot, and you can also use tools like
lsof
>>> to
>>> >> >> find
>>> >> >> them.
>>> >> >>
>>> >> >> If that is indeed the problem, then you might just need to
change
>>> you
>>> >> >> conf/slaves file to use <hostname> instead of localhost,
and then
>>> >> >> restart.
>>> >> >>
>>> >> >> Adam
>>> >> >>
>>> >> >> On Oct 6, 2014 4:27 PM, "Geoffry Roberts" <threadedblue@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>>
>>> >> >>> I have been happily working with Acc, but today things
changed.
>>> No
>>> >> >>> errors
>>> >> >>>
>>> >> >>> Until now I ran everything server side, which meant the
URL was
>>> >> >>> localhost:2181, and life was good.  Today tried running
some of
>>> the
>>> >> >>> same
>>> >> >>> code as a remote client, which means <host name>:2181.
 Things
>>> hang
>>> >> >>> when
>>> >> >>> BatchWriter tries to commit anything and Scan hangs when
it tries
>>> to
>>> >> >>> iterate
>>> >> >>> through a Map.
>>> >> >>>
>>> >> >>> Let's focus on the scan part:
>>> >> >>>
>>> >> >>> scan.fetchColumnFamily(new Text("colfY")); // This executes
then
>>> >> >>> hangs.
>>> >> >>> for(Entry<Key,Value> entry : scan) {
>>> >> >>> def row = entry.getKey().getRow();
>>> >> >>> def value = entry.getValue();
>>> >> >>> println "value=" + value;
>>> >> >>> }
>>> >> >>>
>>> >> >>> This is what appears in the console :
>>> >> >>>
>>> >> >>> 17:22:39.802 C{0} M DEBUG org.apache.zookeeper.ClientCnxn
- Got
>>> ping
>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms
>>> >> >>>
>>> >> >>> 17:22:49.803 C{0} M DEBUG org.apache.zookeeper.ClientCnxn
- Got
>>> ping
>>> >> >>> response for sessionid: 0x148c6f03388005e after 21ms
>>> >> >>>
>>> >> >>> <and on and on>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> The only difference between success and a hang is a URL
change,
>>> and of
>>> >> >>> course being remote.
>>> >> >>>
>>> >> >>> I don't believe this is a firewall issue.  I shutdown the
>>> firewall.
>>> >> >>>
>>> >> >>> Am I missing something?
>>> >> >>>
>>> >> >>> Thanks all.
>>> >> >>>
>>> >> >>> --
>>> >> >>> There are ways and there are ways,
>>> >> >>>
>>> >> >>> Geoffry Roberts
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > There are ways and there are ways,
>>> >> >
>>> >> > Geoffry Roberts
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > There are ways and there are ways,
>>> >
>>> > Geoffry Roberts
>>>
>>
>>
>>
>> --
>> There are ways and there are ways,
>>
>> Geoffry Roberts
>>
>

Mime
View raw message