manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: Web crawling causes Socket Timeout after Database Exception
Date Thu, 11 Oct 2012 02:32:19 GMT
Hi Karl,


I was comparing version 1.0 with old trunk based on version 0.6
implementing CONNECTORS-501(
Medium-scale web crawl with hopcount-based filtering fails to find correct
number of documents).

Running each version with the same MySQL setting and the same throttling,
somehow the version 1.0 hangs with the error.
Since the old trunk completes crawling, I wonder if something has changed.

Just to make sure I will recheck if there are any wrong settings in MCF.

Thanks.

Regards,

Shigeki
2012/10/10 Karl Wright <daddywri@gmail.com>

> Hi Shigeki,
>
> The socket timeout exception is only a warning.  It means that some
> site you are crawling did not accept a socket connection within the
> allowed time (5 minutes I think).  The Web Connector will retry the
> connection a few times, and if it is still rejected, it will
> eventually give up on that page.  One thing you want to check, though,
> is that you are using proper throttling, because if you aren't then
> one cause of this problem is that the webmaster of the site you are
> trying to crawl may have blocked you from accessing it.
>
> The database exception is more problematic.  It means that MySQL
> thinks it took too long for a specific transaction to complete, and
> the database aborted the transaction due to a timeout.  There are two
> ways of dealing with this issue.  One way is to modify your MySQL
> configuration to increase the transaction timeout value to some high
> number.  The second way is to modify ManifoldCF to recognize the
> timeout error specifically, and cause a retry.  But in order to do the
> latter, I would need to know what SQL error code MySQL returns for
> this situation, which will mean we either need to look it up (if we
> can), or modify a ManifoldCF instance to log it when this problem
> occurs.
>
> Please let me know how you would like to proceed.
>
> Karl
>
> On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi
> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >
> > Hi
> >
> > I am having a trouble with crawling web using MCF1.0.
> > I run MCF with MySQL 5.5 and Tomcat 6.0.
> > It should keep crawling contents, but MCF prints the following Database
> > exception log, then hangs.
> > After DB Exception, Socket Time Exception occurs.
> >
> > Anyone has faced this problem?
> >
> > --Database Exception log:
> >
> > ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread
> aborting
> > and restarting due to database connection reset: Database exception:
> > Exception doing query: Lock wait timeout exceeded; try restarting
> > transaction
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> > exception: Exception doing query: Lock wait timeout exceeded; try
> restarting
> > transaction
> >         at
> >
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
> >         at
> >
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
> >         at
> >
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
> >         at
> >
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
> >         at
> >
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
> >         at
> >
> org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
> >         at
> >
> org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:6049)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessAcivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:6159)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:52)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:225)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:7047)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:6011)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1282)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
> restarting
> > transaction
> >         at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
> >         at
> com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
> >         at
> >
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
> >         at
> >
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
> >         at
> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
> >         at
> >
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
> > ERROR 2012-10-10 16:11:06,799 (Worker thread '9') - Worker thread
> aborting
> > and restarting due to database connection reset: Database exception:
> > Exception doing query: Lock wait timeout exceeded; try restarting
> > transaction
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> > exception: Exception doing query: Lock wait timeout exceeded; try
> restarting
> > transaction
> >         at
> >
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
> >         at
> >
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
> >         at
> >
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
> >         at
> >
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
> >         at
> >
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
> >         at
> >
> org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
> >         at
> >
> org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554)
> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
> restarting
> > transaction
> >         at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
> >         at
> com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
> >         at
> >
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
> >         at
> >
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
> >         at
> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
> >         at
> >
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
> >
> >
> >
> > ---- Socket Timeout:
> >
> >
> > DEBUG 2012-10-10 16:16:27,256 (Worker thread '49') - Socket timeout
> > exception trying to close connection: Read timed out
> > java.net.SocketTimeoutException: Read timed out
> >         at java.net.SocketInputStream.socketRead0(Native Method)
> >         at java.net.SocketInputStream.read(SocketInputStream.java:129)
> >         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> >         at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> >         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >         at
> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> Source)
> >         at
> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> Source)
> >         at
> >
> org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown
> > Source)
> >         at
> > org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown
> Source)
> >         at java.io.FilterInputStream.close(FilterInputStream.java:155)
> >         at
> > org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown
> > Source)
> >         at
> org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown
> > Source)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.close(ThrottledFetcher.java:2082)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:176)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
> >  INFO 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: FETCH
> > URL|
> http://xxxxxx/...|1349852786744+600514|-104|4125|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> > Interrupted: Socket timeout: Read timed out
> > DEBUG 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: Fetch exception
> > for 'http://xxxxxx/...'
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted:
> > Socket timeout: Read timed out
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1818)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:797)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
> > Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption:
> > Socket timeout: Read timed out
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:101)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
> >         ... 1 more
> > Caused by: java.net.SocketTimeoutException: Read timed out
> >         at java.net.SocketInputStream.socketRead0(Native Method)
> >         at java.net.SocketInputStream.read(SocketInputStream.java:129)
> >         at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >         at
> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> Source)
> >         at java.io.FilterInputStream.read(FilterInputStream.java:116)
> >         at
> org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
> > Source)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2012)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1976)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95)
> >         ... 2 more
> >  WARN 2012-10-10 16:16:27,274 (Worker thread '49') - Pre-ingest service
> > interruption reported for job 1349774325961 connection 'WEB': Socket
> > timeout: Read timed out
> >
> >
> >
> > Regards,
> >
> > Shigeki
>

Mime
View raw message