manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawling causes Socket Timeout after Database Exception
Date Thu, 18 Oct 2012 10:33:03 GMT
So, what was the resolution of this problem?  Any news?
Karl

On Thu, Oct 11, 2012 at 2:28 AM, Karl Wright <daddywri@gmail.com> wrote:
> The only change is that the MySQL driver now performs ANALYZE
> operations on the fly in order to keep the database operating at high
> efficiency.  This is CONNECTORS-510.  It is possible that, on a large
> database table, these operations will cause others to wait long enough
> so that their timeout is exceeded.  Such an event does not take place
> while the load tests run, however.  If you want to turn off the
> analyze operation, you can do that by setting a per-table property to
> override the analyze default of 10000 operations:
>
> analyzeThreshold =
> ManifoldCF.getIntProperty("org.apache.manifold.db.mysql.analyze."+tableName,10000);
>
> The table in question is "jobqueue".  If you set this value to
> something like 1000000000 and you still see MySQL timeouts, then this
> new code is not the problem.  And, like I said, the best solution is
> to recognize the error and retry, but first I would need the error
> code.  Adding an appropriate output of sqlState around line 123 of
> framework/core/src/main/java/org/apache/manifoldcf/core/database/DBInterfaceMySQL.java
> would allow us to see what code to catch, when it happened again.
>
> For the Web connector, the only modifications have been in regards to
> how it handles 500 errors, which now correctly code to avoid an
> IndexExceptionOutOfBounds exception.  This has nothing to do with
> socket exceptions, which are caused for external reasons only.
>
> Karl
>
>
> On Wed, Oct 10, 2012 at 10:32 PM, Shigeki Kobayashi
> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> Hi Karl,
>>
>>
>> I was comparing version 1.0 with old trunk based on version 0.6 implementing
>> CONNECTORS-501(
>> Medium-scale web crawl with hopcount-based filtering fails to find correct
>> number of documents).
>>
>> Running each version with the same MySQL setting and the same throttling,
>> somehow the version 1.0 hangs with the error.
>> Since the old trunk completes crawling, I wonder if something has changed.
>>
>> Just to make sure I will recheck if there are any wrong settings in MCF.
>>
>> Thanks.
>>
>> Regards,
>>
>> Shigeki
>>
>> 2012/10/10 Karl Wright <daddywri@gmail.com>
>>>
>>> Hi Shigeki,
>>>
>>> The socket timeout exception is only a warning.  It means that some
>>> site you are crawling did not accept a socket connection within the
>>> allowed time (5 minutes I think).  The Web Connector will retry the
>>> connection a few times, and if it is still rejected, it will
>>> eventually give up on that page.  One thing you want to check, though,
>>> is that you are using proper throttling, because if you aren't then
>>> one cause of this problem is that the webmaster of the site you are
>>> trying to crawl may have blocked you from accessing it.
>>>
>>> The database exception is more problematic.  It means that MySQL
>>> thinks it took too long for a specific transaction to complete, and
>>> the database aborted the transaction due to a timeout.  There are two
>>> ways of dealing with this issue.  One way is to modify your MySQL
>>> configuration to increase the transaction timeout value to some high
>>> number.  The second way is to modify ManifoldCF to recognize the
>>> timeout error specifically, and cause a retry.  But in order to do the
>>> latter, I would need to know what SQL error code MySQL returns for
>>> this situation, which will mean we either need to look it up (if we
>>> can), or modify a ManifoldCF instance to log it when this problem
>>> occurs.
>>>
>>> Please let me know how you would like to proceed.
>>>
>>> Karl
>>>
>>> On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi
>>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>>> >
>>> > Hi
>>> >
>>> > I am having a trouble with crawling web using MCF1.0.
>>> > I run MCF with MySQL 5.5 and Tomcat 6.0.
>>> > It should keep crawling contents, but MCF prints the following Database
>>> > exception log, then hangs.
>>> > After DB Exception, Socket Time Exception occurs.
>>> >
>>> > Anyone has faced this problem?
>>> >
>>> > --Database Exception log:
>>> >
>>> > ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread
>>> > aborting
>>> > and restarting due to database connection reset: Database exception:
>>> > Exception doing query: Lock wait timeout exceeded; try restarting
>>> > transaction
>>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
>>> > exception: Exception doing query: Lock wait timeout exceeded; try
>>> > restarting
>>> > transaction
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:6049)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessAcivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:6159)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:52)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:225)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:7047)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:6011)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1282)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
>>> > restarting
>>> > transaction
>>> >         at
>>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
>>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
>>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
>>> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
>>> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
>>> >         at
>>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
>>> >         at
>>> >
>>> > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
>>> >         at
>>> >
>>> > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
>>> >         at
>>> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
>>> > ERROR 2012-10-10 16:11:06,799 (Worker thread '9') - Worker thread
>>> > aborting
>>> > and restarting due to database connection reset: Database exception:
>>> > Exception doing query: Lock wait timeout exceeded; try restarting
>>> > transaction
>>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
>>> > exception: Exception doing query: Lock wait timeout exceeded; try
>>> > restarting
>>> > transaction
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554)
>>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
>>> > restarting
>>> > transaction
>>> >         at
>>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
>>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
>>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
>>> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
>>> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
>>> >         at
>>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
>>> >         at
>>> >
>>> > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
>>> >         at
>>> >
>>> > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
>>> >         at
>>> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
>>> >
>>> >
>>> >
>>> > ---- Socket Timeout:
>>> >
>>> >
>>> > DEBUG 2012-10-10 16:16:27,256 (Worker thread '49') - Socket timeout
>>> > exception trying to close connection: Read timed out
>>> > java.net.SocketTimeoutException: Read timed out
>>> >         at java.net.SocketInputStream.socketRead0(Native Method)
>>> >         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>>> >         at
>>> > java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>> >         at
>>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>>> >         at
>>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>> >         at
>>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>>> > Source)
>>> >         at
>>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>>> > Source)
>>> >         at
>>> >
>>> > org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown
>>> > Source)
>>> >         at
>>> > org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown
>>> > Source)
>>> >         at java.io.FilterInputStream.close(FilterInputStream.java:155)
>>> >         at
>>> > org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown
>>> > Source)
>>> >         at
>>> > org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown
>>> > Source)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.close(ThrottledFetcher.java:2082)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:176)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
>>> >  INFO 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: FETCH
>>> >
>>> > URL|http://xxxxxx/...|1349852786744+600514|-104|4125|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>>> > Interrupted: Socket timeout: Read timed out
>>> > DEBUG 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: Fetch
>>> > exception
>>> > for 'http://xxxxxx/...'
>>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted:
>>> > Socket timeout: Read timed out
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1818)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:797)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
>>> > Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption:
>>> > Socket timeout: Read timed out
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:101)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
>>> >         ... 1 more
>>> > Caused by: java.net.SocketTimeoutException: Read timed out
>>> >         at java.net.SocketInputStream.socketRead0(Native Method)
>>> >         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>>> >         at
>>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>> >         at
>>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>> >         at
>>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>>> > Source)
>>> >         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>> >         at
>>> > org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
>>> > Source)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2012)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1976)
>>> >         at
>>> >
>>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95)
>>> >         ... 2 more
>>> >  WARN 2012-10-10 16:16:27,274 (Worker thread '49') - Pre-ingest service
>>> > interruption reported for job 1349774325961 connection 'WEB': Socket
>>> > timeout: Read timed out
>>> >
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Shigeki
>>
>>
>>
>>

Mime
View raw message