manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawling causes Socket Timeout after Database Exception
Date Wed, 10 Oct 2012 08:05:27 GMT
Hi Shigeki,

The socket timeout exception is only a warning.  It means that some
site you are crawling did not accept a socket connection within the
allowed time (5 minutes I think).  The Web Connector will retry the
connection a few times, and if it is still rejected, it will
eventually give up on that page.  One thing you want to check, though,
is that you are using proper throttling, because if you aren't then
one cause of this problem is that the webmaster of the site you are
trying to crawl may have blocked you from accessing it.

The database exception is more problematic.  It means that MySQL
thinks it took too long for a specific transaction to complete, and
the database aborted the transaction due to a timeout.  There are two
ways of dealing with this issue.  One way is to modify your MySQL
configuration to increase the transaction timeout value to some high
number.  The second way is to modify ManifoldCF to recognize the
timeout error specifically, and cause a retry.  But in order to do the
latter, I would need to know what SQL error code MySQL returns for
this situation, which will mean we either need to look it up (if we
can), or modify a ManifoldCF instance to log it when this problem
occurs.

Please let me know how you would like to proceed.

Karl

On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi
<shigeki.kobayashi3@g.softbank.co.jp> wrote:
>
> Hi
>
> I am having a trouble with crawling web using MCF1.0.
> I run MCF with MySQL 5.5 and Tomcat 6.0.
> It should keep crawling contents, but MCF prints the following Database
> exception log, then hangs.
> After DB Exception, Socket Time Exception occurs.
>
> Anyone has faced this problem?
>
> --Database Exception log:
>
> ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread aborting
> and restarting due to database connection reset: Database exception:
> Exception doing query: Lock wait timeout exceeded; try restarting
> transaction
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> exception: Exception doing query: Lock wait timeout exceeded; try restarting
> transaction
>         at
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
>         at
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
>         at
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
>         at
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
>         at
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
>         at
> org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
>         at
> org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:6049)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessAcivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:6159)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:52)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:225)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:7047)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:6011)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1282)
>         at
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
> Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting
> transaction
>         at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
>         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
>         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
>         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
>         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
>         at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
>         at
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
>         at
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
>         at
> org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
>         at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
> ERROR 2012-10-10 16:11:06,799 (Worker thread '9') - Worker thread aborting
> and restarting due to database connection reset: Database exception:
> Exception doing query: Lock wait timeout exceeded; try restarting
> transaction
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> exception: Exception doing query: Lock wait timeout exceeded; try restarting
> transaction
>         at
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
>         at
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
>         at
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
>         at
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
>         at
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
>         at
> org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
>         at
> org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554)
> Caused by: java.sql.SQLException: Lock wait timeout exceeded; try restarting
> transaction
>         at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
>         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
>         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
>         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
>         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
>         at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
>         at
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
>         at
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
>         at
> org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
>         at
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
>
>
>
> ---- Socket Timeout:
>
>
> DEBUG 2012-10-10 16:16:27,256 (Worker thread '49') - Socket timeout
> exception trying to close connection: Read timed out
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at
> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source)
>         at
> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source)
>         at
> org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown
> Source)
>         at
> org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown Source)
>         at java.io.FilterInputStream.close(FilterInputStream.java:155)
>         at
> org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown
> Source)
>         at org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown
> Source)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.close(ThrottledFetcher.java:2082)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:176)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
>  INFO 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: FETCH
> URL|http://xxxxxx/...|1349852786744+600514|-104|4125|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> Interrupted: Socket timeout: Read timed out
> DEBUG 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: Fetch exception
> for 'http://xxxxxx/...'
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted:
> Socket timeout: Read timed out
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1818)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:797)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
> Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption:
> Socket timeout: Read timed out
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:101)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
>         ... 1 more
> Caused by: java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at
> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown Source)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
> Source)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2012)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1976)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95)
>         ... 2 more
>  WARN 2012-10-10 16:16:27,274 (Worker thread '49') - Pre-ingest service
> interruption reported for job 1349774325961 connection 'WEB': Socket
> timeout: Read timed out
>
>
>
> Regards,
>
> Shigeki

Mime
View raw message