hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alvaro Cabrerizo" <topor...@gmail.com>
Subject Re: is it closed nutch-14 bug? (minimized efects)
Date Wed, 14 Feb 2007 17:02:39 GMT
Hi:

Although I havent found the solution to the problem, at least I could
minimize the effects. First of all I'll try to explain what I think is
happening here.

Lets have a look a thread dump when system goes down:

Full thread dump Java HotSpot(TM) Server VM (1.5.0_07-b03 mixed mode):

"Server handler 4 on 55555" daemon prio=1 tid=0x08464710 nid=0x1317 in
Object.wait() [0x7a88e000..0x7a88f040]
    at java.lang.Object.wait(Native Method)
    - waiting on <0x80cefc90> (a java.util.LinkedList)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471)
    - locked <0x80cefc90> (a java.util.LinkedList)

"Server handler 3 on 55555" daemon prio=1 tid=0x085cd550 nid=0x1316 in
Object.wait() [0x7a90f000..0x7a90f1c0]
    at java.lang.Object.wait(Native Method)
    - waiting on <0x80cefc90> (a java.util.LinkedList)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471)
    - locked <0x80cefc90> (a java.util.LinkedList)

"Server handler 2 on 55555" daemon prio=1 tid=0x0867ea70 nid=0x1315 in
Object.wait() [0x7a98e000..0x7a98f140]
    at java.lang.Object.wait(Native Method)
    - waiting on <0x80cefc90> (a java.util.LinkedList)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471)
    - locked <0x80cefc90> (a java.util.LinkedList)

"Server handler 1 on 55555" daemon prio=1 tid=0x0867e440 nid=0x1314 in
Object.wait() [0x7aa0e000..0x7aa0eec0]
    at java.lang.Object.wait(Native Method)
    - waiting on <0x80cefc90> (a java.util.LinkedList)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471)
    - locked <0x80cefc90> (a java.util.LinkedList)

"Server handler 0 on 55555" daemon prio=1 tid=0x0867ec38 nid=0x1313 in
Object.wait() [0x7aa8e000..0x7aa8ee40]
    at java.lang.Object.wait(Native Method)
    - waiting on <0x80cefc90> (a java.util.LinkedList)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471)
    - locked <0x80cefc90> (a java.util.LinkedList)

"Server listener on port 55555" daemon prio=1 tid=0x086dafd8 nid=0x1312 in
Object.wait() [0x7ab0e000..0x7ab0efc0]
    at java.lang.Object.wait(Native Method)
    - waiting on <0x80d15e40> (a java.lang.Object)
    at org.apache.hadoop.ipc.Server$Connection.processData(Server.java:433)
    - locked <0x80d15e40> (a java.lang.Object)
    at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java
:402)
    at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:300)
    at org.apache.hadoop.ipc.Server$Listener.run(Server.java:213)

"Low Memory Detector" daemon prio=1 tid=0x080f6c88 nid=0x1310 runnable
[0x00000000..0x00000000]

"CompilerThread1" daemon prio=1 tid=0x080f5800 nid=0x130f waiting on
condition [0x00000000..0x7ccd5268]

"CompilerThread0" daemon prio=1 tid=0x080f47a0 nid=0x130e waiting on
condition [0x00000000..0x7cd553e8]

"AdapterThread" daemon prio=1 tid=0x080f3660 nid=0x130d waiting on condition
[0x00000000..0x00000000]

"Signal Dispatcher" daemon prio=1 tid=0x080f2780 nid=0x130c waiting on
condition [0x00000000..0x00000000]

"Finalizer" daemon prio=1 tid=0x080e7fb0 nid=0x130b in Object.wait()
[0x7d058000..0x7d058e40]
    at java.lang.Object.wait(Native Method)
    - waiting on <0x80c603c8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
    - locked <0x80c603c8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=1 tid=0x080e7af8 nid=0x130a in Object.wait()
[0x7d0d8000..0x7d0d8fc0]
    at java.lang.Object.wait(Native Method)
    - waiting on <0x80c600b0> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:474)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
    - locked <0x80c600b0> (a java.lang.ref.Reference$Lock)

"main" prio=1 tid=0x0805e250 nid=0x1305 in Object.wait()
[0xaffab000..0xaffac158]
    at java.lang.Object.wait(Native Method)
    - waiting on <0x80cefbe0> (a org.apache.hadoop.ipc.RPC$Server)
    at java.lang.Object.wait(Object.java:474)
    at org.apache.hadoop.ipc.Server.join(Server.java:587)
    - locked <0x80cefbe0> (a org.apache.hadoop.ipc.RPC$Server)
    at org.apache.nutch.searcher.DistributedSearch$Server.main(
DistributedSearch.java:75)

"VM Thread" prio=1 tid=0x080e5668 nid=0x1309 runnable

"VM Periodic Task Thread" prio=1 tid=0x080f8128 nid=0x1311 waiting on
condition


We can see that all of the handlers (I launch the server using 5 handlers)
are waiting for new queued calls, we can watch it in the next  piece of code
from the run method that belongs to class Handler:

 while (running && callQueue.size()==0) { // wait for a call
              callQueue.wait(timeout);    //line 471
            }

In the other hand the Listener is waiting because it thinks queue is full.
This is shown in the next piece of code from the processData method that
belongs to class Connection:

while (running && callQueue.size() >= maxQueuedCalls) {
          synchronized (callDequeued) {         // queue is full1
          callDequeued.wait(timeout);         // wait for a dequeue for 60
seconds (line 433)
        }
}

So Listener thinks queue is full and Handlers that it is empty, somebody
must be wrong. As we can see in the next snippet, from the method run of the
class Handler, every Handler notify a dequeued call, and this should be
enough to wake up the Listener:

while (running) {
        try {
          Call call;
          synchronized (callQueue) {
            while (running && callQueue.size()==0) { // wait for a call
                 callQueue.wait(timeout);
            }
            if (!running) break;
             call = (Call)callQueue.removeFirst(); // pop the queue
          }

          synchronized (callDequeued) {           // tell others we've
dequeued
            callDequeued.notify(); // <-this should wake up Listener but
sometimes dont
          }

So, in order to minimize the problem of having the Listener stopped for 60
seconds I've reduced its waiting time to 500ms.

I've also added some traces generated during a load test.

2007-02-13 21:11:46,910 INFO  ipc.Server - ----queue is full. The server
should wait. 5 calls left                      (line 1)
2007-02-13 21:11:46,911 INFO  ipc.RPC - Return:  ... fecha de 30 de junio de
2005 por la que se...
2007-02-13 21:11:46,911 INFO  ipc.Server - dequeued a call, 4 calls
left
2007-02-13 21:11:46,914 INFO  ipc.RPC - Return:  ... E, e incluida en la
Oferta de Empleo Públi...
2007-02-13 21:11:46,915 INFO  ipc.Server - dequeued a call, 3 calls
left                                                           (line 5)
2007-02-13 21:11:46,922 INFO  ipc.RPC - Return:  ... PROMOCION INTERNA UN
OFICIAL CELADOR OFERT...
2007-02-13 21:11:46,923 INFO  ipc.Server - dequeued a call, 2 calls left
2007-02-13 21:11:46,923 INFO  ipc.RPC - Return:
org.apache.nutch.searcher.Hits@1782e36
2007-02-13 21:11:46,923 INFO  ipc.Server - dequeued a call, 1 calls left
2007-02-13 21:11:46,923 INFO  ipc.Server - notify dequeue WAKE UP
LISTENER!!                                              (line 10)
2007-02-13 21:11:46,923 INFO  ipc.RPC - Call: getDetails(#14599)
2007-02-13 21:11:46,923 INFO  ipc.RPC - Return:
20060925130810/http://andaluciajunta.es/portal/...
2007-02-13 21:11:46,924 INFO  ipc.Server - dequeued a call, 0 calls left
2007-02-13 21:11:46,924 INFO  ipc.Server - notify dequeue WAKE UP LISTENER!!
2007-02-13 21:11:46,924 INFO  ipc.RPC - Call:
getSummary(20060925130810/http://andaluciajunta.e...          (line 15)
2007-02-13 21:11:46,928 INFO  ipc.RPC - Return:  ... parte de ellos. 2. La
Junta se entenderá v...
2007-02-13 21:11:46,928 INFO  ipc.Server - Handler waiting a call
2007-02-13 21:11:46,928 INFO  ipc.Server - notify dequeue WAKE UP LISTENER!!
2007-02-13 21:11:46,928 INFO  ipc.Server - notify dequeue WAKE UP LISTENER!!
2007-02-13 21:11:46,928 INFO  ipc.RPC - Call:
getSummary(20060925130810/http://andaluciajunta.e...          (line 20)
2007-02-13 21:11:46,931 INFO  ipc.RPC - Return:  ... encomienden el
Presidente o la Junta de Go...
2007-02-13 21:11:46,932 INFO  ipc.Server - Handler waiting a call
2007-02-13 21:11:46,932 INFO  ipc.Server - notify dequeue WAKE UP LISTENER!!
2007-02-13 21:11:46,932 INFO  ipc.RPC - Call:
getSummary(20060925130810/http://andaluciajunta.e...
2007-02-13 21:11:46,935 INFO  ipc.RPC - Return:  ... composición y
convocatoria de la Junta Gen...                   (line 25)
2007-02-13 21:11:46,935 INFO  ipc.Server - Handler waiting a call
2007-02-13 21:11:46,935 INFO  ipc.Server - notify dequeue WAKE UP LISTENER!!
2007-02-13 21:11:46,935 INFO  ipc.RPC - Call:
getSummary(20060925130719/http://andaluciajunta.e...
2007-02-13 21:11:46,935 INFO  ipc.RPC - Return:  ... Boletín Oficial de la
Junta de Andalucía. ...
2007-02-13 21:11:46,935 INFO  ipc.Server - Handler waiting a
call
(line 30)
2007-02-13 21:11:46,935 INFO  ipc.Server - server is stopped for 5 seconds
2007-02-13 21:11:46,935 INFO  ipc.RPC - Call:
getSummary(20060925130810/http://andaluciajunta.e...
2007-02-13 21:11:46,939 INFO  ipc.RPC - Return:  ... electos que formen la
nueva Junta de Gobie...
2007-02-13 21:11:46,939 INFO  ipc.Server - Handler waiting a call
2007-02-13 21:11:51,936 INFO  ipc.Server - queued a new Call
org.apache.hadoop.ipc.Server$Call@121ef78      (line 35)

Observe that it tooks 5 seconds, the Listener timeout i wrote for the load
test, from the last but one to the last log line.

In the first line the queue is full of calls so it is satisfied the
condition to stop the Listener, remember this piece of code from Connection:

while (running && callQueue.size() >= maxQueuedCalls) {
          LOG.info("queued is full. The server shoul wait.
"+callQueue.size()+" calls left");
          synchronized (callDequeued) {         // queue is full1
          LOG.info(" server is stopped for 5 seconds");
          callDequeued.wait(5000);         // wait for a dequeue for 60
seconds
        }
}

But before server starts waiting all the calls are processed (from line 2 to
line 31).  Then, when Listener executes "callDequeued.wait(5000)" none of
the handlers should have locked callDequeued, so why it waits instead of
keep on running?




2007/2/5, Alvaro Cabrerizo <toporniz@gmail.com>:
>
> Hi,
>
> I've detected that kernel version on machines were 2.4 instead of 2.6. So
> during test It was used linuxthreads-0.10. After reading
> http://www.unixville.com/~moazam/stories/2004/05/18/debuggingHangsInTheJvm.html<http://www.unixville.com/%7Emoazam/stories/2004/05/18/debuggingHangsInTheJvm.html>,
> I`ve decided to move to kernel 2.6 in order to install NPTL (Native Posix
> Thread Library). So it would be very helpful to know which kernel version
> and which library to manage threads was used when the bug was closed.
>
> Thanks in  advance.
>
>
>
> 2007/1/31, Alvaro Cabrerizo <toporniz@gmail.com>:
> >
> > Hi:
> >
> > During a load test using Jmeter I get this error:
> >
> > 2007-01-31 09:27:47 StandardWrapperValve[jsp]: Servlet.service() para
> > servlet jsp lanzó excepción
> > java.lang.NullPointerException
> >         at org.apache.nutch.searcher.DistributedSearch$Client.getSummary(
> > DistributedSearch.java:323)
> >         at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java
> > :349)
> >         at org.apache.jsp.search_jsp._jspService(search_jsp.java:267)
> >         at org.apache.jasper.runtime.HttpJspBase.service (
> > HttpJspBase.java:94)
> >         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> >         at org.apache.jasper.servlet.JspServletWrapper.service(
> > JspServletWrapper.java:324)
> >         at org.apache.jasper.servlet.JspServlet.serviceJspFile (
> > JspServlet.java:292)
> >         at org.apache.jasper.servlet.JspServlet.service(JspServlet.java
> > :236)
> > ...
> > 007-01-31 09:27:47 StandardWrapperValve[jsp]: Servlet.service() para
> > servlet jsp lanzó excepción
> > java.io.IOException : no responses
> >         at org.apache.hadoop.ipc.Client.call(Client.java:344)
> >         at org.apache.hadoop.ipc.RPC.call(RPC.java:174)
> >         at org.apache.nutch.searcher.DistributedSearch$Client.getDetails
> > (DistributedSearch.java :307)
> >         at org.apache.nutch.searcher.NutchBean.getDetails(NutchBean.java
> > :338)
> >         at org.apache.jsp.search_jsp._jspService(search_jsp.java:266)
> >
> > The test was done using three machines, in the same lan segment, with
> > this configuration:
> >
> >  -Pentium 4 3.0GHz
> >  -1GB RAM
> >  -Jdk 1.5
> >  -Tomcat 5
> >  -linux 2.6 kernel.
> >  -nutch-0.8
> >  -hadoop-0.4.0
> >
> > The first machine runs a "search-server" that loads a Nutch index of
> > 20000 documents, about 20MB of disk space for the Lucene index. The second
> > one runs a tomcat that communicates with the "search-server" and the third
> > one runs jmeter.
> >
> > The request that launches the exception takes about 60 seconds, while
> > rest take about 650ms. Sysstat indicates that first and second machine have
> > the CPU idle, close to 100%, no swapping and still lot of free memory.
> >
> > As I can see that is related with the closed bug nutch-14 (http://issues.apache.org/jira/browse/NUTCH-14
> > ).
> >
> > Any help will be apreciated in order to solve this problem.
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message