Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 80025 invoked from network); 14 Feb 2007 17:03:08 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 14 Feb 2007 17:03:08 -0000 Received: (qmail 39833 invoked by uid 500); 14 Feb 2007 17:03:14 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 39811 invoked by uid 500); 14 Feb 2007 17:03:14 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 39802 invoked by uid 99); 14 Feb 2007 17:03:14 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Feb 2007 09:03:14 -0800 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=HTML_MESSAGE,INFO_TLD,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of toporniz@gmail.com designates 66.249.92.174 as permitted sender) Received: from [66.249.92.174] (HELO ug-out-1314.google.com) (66.249.92.174) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 14 Feb 2007 09:03:03 -0800 Received: by ug-out-1314.google.com with SMTP id k40so269378ugc for ; Wed, 14 Feb 2007 09:02:41 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:mime-version:content-type; b=akR7W0WBt9ac55nTK40OvMYSwai67yLoSYhwBiefoJxaHsDFbjg6KHt1oAKHMpWsikWtL6B0hE1XjQqoFgMr4AQTOci+w5LZCjdkOU3+9aAJ2FaoYVP0ayENiLcWpPu81Chs6WwaOkJBB6cvNOzo0Xajawm3h6hfQDSZ10fdYXA= Received: by 10.114.124.1 with SMTP id w1mr395841wac.1171472559938; Wed, 14 Feb 2007 09:02:39 -0800 (PST) Received: by 10.114.89.7 with HTTP; Wed, 14 Feb 2007 09:02:39 -0800 (PST) Message-ID: <98d722580702140902g2e8707f9o9af6e52cf89050b5@mail.gmail.com> Date: Wed, 14 Feb 2007 18:02:39 +0100 From: "Alvaro Cabrerizo" To: hadoop-user@lucene.apache.org Subject: Re: is it closed nutch-14 bug? (minimized efects) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_76413_20048286.1171472559741" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_76413_20048286.1171472559741 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Hi: Although I havent found the solution to the problem, at least I could minimize the effects. First of all I'll try to explain what I think is happening here. Lets have a look a thread dump when system goes down: Full thread dump Java HotSpot(TM) Server VM (1.5.0_07-b03 mixed mode): "Server handler 4 on 55555" daemon prio=3D1 tid=3D0x08464710 nid=3D0x1317 i= n Object.wait() [0x7a88e000..0x7a88f040] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server handler 3 on 55555" daemon prio=3D1 tid=3D0x085cd550 nid=3D0x1316 i= n Object.wait() [0x7a90f000..0x7a90f1c0] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server handler 2 on 55555" daemon prio=3D1 tid=3D0x0867ea70 nid=3D0x1315 i= n Object.wait() [0x7a98e000..0x7a98f140] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server handler 1 on 55555" daemon prio=3D1 tid=3D0x0867e440 nid=3D0x1314 i= n Object.wait() [0x7aa0e000..0x7aa0eec0] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server handler 0 on 55555" daemon prio=3D1 tid=3D0x0867ec38 nid=3D0x1313 i= n Object.wait() [0x7aa8e000..0x7aa8ee40] at java.lang.Object.wait(Native Method) - waiting on <0x80cefc90> (a java.util.LinkedList) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:471) - locked <0x80cefc90> (a java.util.LinkedList) "Server listener on port 55555" daemon prio=3D1 tid=3D0x086dafd8 nid=3D0x13= 12 in Object.wait() [0x7ab0e000..0x7ab0efc0] at java.lang.Object.wait(Native Method) - waiting on <0x80d15e40> (a java.lang.Object) at org.apache.hadoop.ipc.Server$Connection.processData(Server.java:433) - locked <0x80d15e40> (a java.lang.Object) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java :402) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:300) at org.apache.hadoop.ipc.Server$Listener.run(Server.java:213) "Low Memory Detector" daemon prio=3D1 tid=3D0x080f6c88 nid=3D0x1310 runnabl= e [0x00000000..0x00000000] "CompilerThread1" daemon prio=3D1 tid=3D0x080f5800 nid=3D0x130f waiting on condition [0x00000000..0x7ccd5268] "CompilerThread0" daemon prio=3D1 tid=3D0x080f47a0 nid=3D0x130e waiting on condition [0x00000000..0x7cd553e8] "AdapterThread" daemon prio=3D1 tid=3D0x080f3660 nid=3D0x130d waiting on co= ndition [0x00000000..0x00000000] "Signal Dispatcher" daemon prio=3D1 tid=3D0x080f2780 nid=3D0x130c waiting o= n condition [0x00000000..0x00000000] "Finalizer" daemon prio=3D1 tid=3D0x080e7fb0 nid=3D0x130b in Object.wait() [0x7d058000..0x7d058e40] at java.lang.Object.wait(Native Method) - waiting on <0x80c603c8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) - locked <0x80c603c8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) "Reference Handler" daemon prio=3D1 tid=3D0x080e7af8 nid=3D0x130a in Object= .wait() [0x7d0d8000..0x7d0d8fc0] at java.lang.Object.wait(Native Method) - waiting on <0x80c600b0> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:474) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked <0x80c600b0> (a java.lang.ref.Reference$Lock) "main" prio=3D1 tid=3D0x0805e250 nid=3D0x1305 in Object.wait() [0xaffab000..0xaffac158] at java.lang.Object.wait(Native Method) - waiting on <0x80cefbe0> (a org.apache.hadoop.ipc.RPC$Server) at java.lang.Object.wait(Object.java:474) at org.apache.hadoop.ipc.Server.join(Server.java:587) - locked <0x80cefbe0> (a org.apache.hadoop.ipc.RPC$Server) at org.apache.nutch.searcher.DistributedSearch$Server.main( DistributedSearch.java:75) "VM Thread" prio=3D1 tid=3D0x080e5668 nid=3D0x1309 runnable "VM Periodic Task Thread" prio=3D1 tid=3D0x080f8128 nid=3D0x1311 waiting on condition We can see that all of the handlers (I launch the server using 5 handlers) are waiting for new queued calls, we can watch it in the next piece of cod= e from the run method that belongs to class Handler: while (running && callQueue.size()=3D=3D0) { // wait for a call callQueue.wait(timeout); //line 471 } In the other hand the Listener is waiting because it thinks queue is full. This is shown in the next piece of code from the processData method that belongs to class Connection: while (running && callQueue.size() >=3D maxQueuedCalls) { synchronized (callDequeued) { // queue is full1 callDequeued.wait(timeout); // wait for a dequeue for 60 seconds (line 433) } } So Listener thinks queue is full and Handlers that it is empty, somebody must be wrong. As we can see in the next snippet, from the method run of th= e class Handler, every Handler notify a dequeued call, and this should be enough to wake up the Listener: while (running) { try { Call call; synchronized (callQueue) { while (running && callQueue.size()=3D=3D0) { // wait for a call callQueue.wait(timeout); } if (!running) break; call =3D (Call)callQueue.removeFirst(); // pop the queue } synchronized (callDequeued) { // tell others we've dequeued callDequeued.notify(); // <-this should wake up Listener but sometimes dont } So, in order to minimize the problem of having the Listener stopped for 60 seconds I've reduced its waiting time to 500ms. I've also added some traces generated during a load test. 2007-02-13 21:11:46,910 INFO ipc.Server - ----queue is full. The server should wait. 5 calls left (line 1) 2007-02-13 21:11:46,911 INFO ipc.RPC - Return: ... fecha de 30 de junio d= e 2005 por la que se... 2007-02-13 21:11:46,911 INFO ipc.Server - dequeued a call, 4 calls left 2007-02-13 21:11:46,914 INFO ipc.RPC - Return: ... E, e incluida en la Oferta de Empleo P=FAbli... 2007-02-13 21:11:46,915 INFO ipc.Server - dequeued a call, 3 calls left (line 5) 2007-02-13 21:11:46,922 INFO ipc.RPC - Return: ... PROMOCION INTERNA UN OFICIAL CELADOR OFERT... 2007-02-13 21:11:46,923 INFO ipc.Server - dequeued a call, 2 calls left 2007-02-13 21:11:46,923 INFO ipc.RPC - Return: org.apache.nutch.searcher.Hits@1782e36 2007-02-13 21:11:46,923 INFO ipc.Server - dequeued a call, 1 calls left 2007-02-13 21:11:46,923 INFO ipc.Server - notify dequeue WAKE UP LISTENER!! (line 10) 2007-02-13 21:11:46,923 INFO ipc.RPC - Call: getDetails(#14599) 2007-02-13 21:11:46,923 INFO ipc.RPC - Return: 20060925130810/http://andaluciajunta.es/portal/... 2007-02-13 21:11:46,924 INFO ipc.Server - dequeued a call, 0 calls left 2007-02-13 21:11:46,924 INFO ipc.Server - notify dequeue WAKE UP LISTENER!= ! 2007-02-13 21:11:46,924 INFO ipc.RPC - Call: getSummary(20060925130810/http://andaluciajunta.e... (line 15) 2007-02-13 21:11:46,928 INFO ipc.RPC - Return: ... parte de ellos. 2. La Junta se entender=E1 v... 2007-02-13 21:11:46,928 INFO ipc.Server - Handler waiting a call 2007-02-13 21:11:46,928 INFO ipc.Server - notify dequeue WAKE UP LISTENER!= ! 2007-02-13 21:11:46,928 INFO ipc.Server - notify dequeue WAKE UP LISTENER!= ! 2007-02-13 21:11:46,928 INFO ipc.RPC - Call: getSummary(20060925130810/http://andaluciajunta.e... (line 20) 2007-02-13 21:11:46,931 INFO ipc.RPC - Return: ... encomienden el Presidente o la Junta de Go... 2007-02-13 21:11:46,932 INFO ipc.Server - Handler waiting a call 2007-02-13 21:11:46,932 INFO ipc.Server - notify dequeue WAKE UP LISTENER!= ! 2007-02-13 21:11:46,932 INFO ipc.RPC - Call: getSummary(20060925130810/http://andaluciajunta.e... 2007-02-13 21:11:46,935 INFO ipc.RPC - Return: ... composici=F3n y convocatoria de la Junta Gen... (line 25) 2007-02-13 21:11:46,935 INFO ipc.Server - Handler waiting a call 2007-02-13 21:11:46,935 INFO ipc.Server - notify dequeue WAKE UP LISTENER!= ! 2007-02-13 21:11:46,935 INFO ipc.RPC - Call: getSummary(20060925130719/http://andaluciajunta.e... 2007-02-13 21:11:46,935 INFO ipc.RPC - Return: ... Bolet=EDn Oficial de l= a Junta de Andaluc=EDa. ... 2007-02-13 21:11:46,935 INFO ipc.Server - Handler waiting a call (line 30) 2007-02-13 21:11:46,935 INFO ipc.Server - server is stopped for 5 seconds 2007-02-13 21:11:46,935 INFO ipc.RPC - Call: getSummary(20060925130810/http://andaluciajunta.e... 2007-02-13 21:11:46,939 INFO ipc.RPC - Return: ... electos que formen la nueva Junta de Gobie... 2007-02-13 21:11:46,939 INFO ipc.Server - Handler waiting a call 2007-02-13 21:11:51,936 INFO ipc.Server - queued a new Call org.apache.hadoop.ipc.Server$Call@121ef78 (line 35) Observe that it tooks 5 seconds, the Listener timeout i wrote for the load test, from the last but one to the last log line. In the first line the queue is full of calls so it is satisfied the condition to stop the Listener, remember this piece of code from Connection= : while (running && callQueue.size() >=3D maxQueuedCalls) { LOG.info("queued is full. The server shoul wait. "+callQueue.size()+" calls left"); synchronized (callDequeued) { // queue is full1 LOG.info(" server is stopped for 5 seconds"); callDequeued.wait(5000); // wait for a dequeue for 60 seconds } } But before server starts waiting all the calls are processed (from line 2 t= o line 31). Then, when Listener executes "callDequeued.wait(5000)" none of the handlers should have locked callDequeued, so why it waits instead of keep on running? 2007/2/5, Alvaro Cabrerizo : > > Hi, > > I've detected that kernel version on machines were 2.4 instead of 2.6. So > during test It was used linuxthreads-0.10. After reading > http://www.unixville.com/~moazam/stories/2004/05/18/debuggingHangsInTheJv= m.html, > I`ve decided to move to kernel 2.6 in order to install NPTL (Native Posix > Thread Library). So it would be very helpful to know which kernel version > and which library to manage threads was used when the bug was closed. > > Thanks in advance. > > > > 2007/1/31, Alvaro Cabrerizo : > > > > Hi: > > > > During a load test using Jmeter I get this error: > > > > 2007-01-31 09:27:47 StandardWrapperValve[jsp]: Servlet.service() para > > servlet jsp lanz=F3 excepci=F3n > > java.lang.NullPointerException > > at org.apache.nutch.searcher.DistributedSearch$Client.getSummar= y( > > DistributedSearch.java:323) > > at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.jav= a > > :349) > > at org.apache.jsp.search_jsp._jspService(search_jsp.java:267) > > at org.apache.jasper.runtime.HttpJspBase.service ( > > HttpJspBase.java:94) > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > > at org.apache.jasper.servlet.JspServletWrapper.service( > > JspServletWrapper.java:324) > > at org.apache.jasper.servlet.JspServlet.serviceJspFile ( > > JspServlet.java:292) > > at org.apache.jasper.servlet.JspServlet.service(JspServlet.java > > :236) > > ... > > 007-01-31 09:27:47 StandardWrapperValve[jsp]: Servlet.service() para > > servlet jsp lanz=F3 excepci=F3n > > java.io.IOException : no responses > > at org.apache.hadoop.ipc.Client.call(Client.java:344) > > at org.apache.hadoop.ipc.RPC.call(RPC.java:174) > > at org.apache.nutch.searcher.DistributedSearch$Client.getDetail= s > > (DistributedSearch.java :307) > > at org.apache.nutch.searcher.NutchBean.getDetails(NutchBean.jav= a > > :338) > > at org.apache.jsp.search_jsp._jspService(search_jsp.java:266) > > > > The test was done using three machines, in the same lan segment, with > > this configuration: > > > > -Pentium 4 3.0GHz > > -1GB RAM > > -Jdk 1.5 > > -Tomcat 5 > > -linux 2.6 kernel. > > -nutch-0.8 > > -hadoop-0.4.0 > > > > The first machine runs a "search-server" that loads a Nutch index of > > 20000 documents, about 20MB of disk space for the Lucene index. The sec= ond > > one runs a tomcat that communicates with the "search-server" and the th= ird > > one runs jmeter. > > > > The request that launches the exception takes about 60 seconds, while > > rest take about 650ms. Sysstat indicates that first and second machine = have > > the CPU idle, close to 100%, no swapping and still lot of free memory. > > > > As I can see that is related with the closed bug nutch-14 (http://issue= s.apache.org/jira/browse/NUTCH-14 > > ). > > > > Any help will be apreciated in order to solve this problem. > > > > ------=_Part_76413_20048286.1171472559741--