Hi. I am running manifold 1.4.1 with patch 813. I am using postgres 9.3.2 for the database. There is a strange problem with the web crawler where if I run two simultaneous crawls then the crawls fairly quickly hang and the logfile shows no activity other than "Idle cleanup thread" messages. However, if I run a single crawl, then that crawl runs for days, either finishing or indefinitely fetching more documents.

Usually the two sites I crawl are www.fbi.gov and www.cnn.com. The crawls are vanilla except that I vary the number of connections from 2 to 8 per crawl, and sometimes I select the option to never delete unreachable documents. Also, I have varied the number of cralwer threads from 30 to 60, and I have set the number of database handles to 200. No matter, however, the crawls always hang.

I looked at the threads after the crawls hanged, and it looks like some threads are forever waiting for a signal. Most of the crawler threads are waiting for a connector:

      Name: Worker thread '0'
      State: WAITING on org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory$Pool@1932ab0
      Total blocked: 57,189 Total waited: 59,158
      Stack trace: 
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:503)
      org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory$Pool.getConnector(RepositoryConnectorFactory.java:591)
      org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.grab(RepositoryConnectorFactory.java:384)
      org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:254)

However, some threads are waiting on a response from a URL fetch:

      Name: Worker thread '24'
      State: BLOCKED on org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread@6a082f owned by: Thread-3289186
      Total blocked: 60,050 Total waited: 62,264
      Stack trace: 
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:503)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread.getResponseCode(ThrottledFetcher.java:2511)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.executeFetch(ThrottledFetcher.java:1610)
      org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:724)
      org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:322)

and some threads are waiting on a connection to use to download a URL:

      Name: Worker thread '28'
      State: WAITING on java.lang.Integer@984b34
      Total blocked: 93,074 Total waited: 96,142
      Stack trace: 
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:503)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher.getConnection(ThrottledFetcher.java:413)
      org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:714)
      org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:322)

while one was waiting to "finish up":

      Name: Worker thread '32'
      State: WAITING on org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread@3232b0
      Total blocked: 71,716 Total waited: 73,738
      Stack trace: 
      java.lang.Object.wait(Native Method)
      java.lang.Thread.join(Thread.java:1260)
      java.lang.Thread.join(Thread.java:1334)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread.finishUp(ThrottledFetcher.java:2629)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.doneFetch(ThrottledFetcher.java:1926)
      org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:804)

It looks like there were also four threads spawned to download the data from the InputStream:

      Name: Thread-3278771
      State: WAITING on org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottleBin@ec30e9
      Total blocked: 0 Total waited: 1
      Stack trace: 
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:503)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottleBin.beginRead(ThrottledFetcher.java:831)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.beginRead(ThrottledFetcher.java:1200)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2133)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:2114)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:2077)
      java.util.zip.CheckedInputStream.read(CheckedInputStream.java:59)
      java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:262)
      java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:254)
      java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:163)
      java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:78)
      java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:90)
      org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread.run(ThrottledFetcher.java:2428)
      locked org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ExecuteMethodThread@144d709


According to jconsole, which I used to get these stack traces, there were no deadlocked threads. However, as the stack traces show, many of these stack traces are blocked in wait() calls.

Any help you can offer to keep our web crawls from hanging will be greatly appreciated.

thank you
Tom Rees