manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawling causes Socket Timeout after Database Exception
Date Thu, 11 Oct 2012 06:28:23 GMT
The only change is that the MySQL driver now performs ANALYZE
operations on the fly in order to keep the database operating at high
efficiency.  This is CONNECTORS-510.  It is possible that, on a large
database table, these operations will cause others to wait long enough
so that their timeout is exceeded.  Such an event does not take place
while the load tests run, however.  If you want to turn off the
analyze operation, you can do that by setting a per-table property to
override the analyze default of 10000 operations:

analyzeThreshold =
ManifoldCF.getIntProperty("org.apache.manifold.db.mysql.analyze."+tableName,10000);

The table in question is "jobqueue".  If you set this value to
something like 1000000000 and you still see MySQL timeouts, then this
new code is not the problem.  And, like I said, the best solution is
to recognize the error and retry, but first I would need the error
code.  Adding an appropriate output of sqlState around line 123 of
framework/core/src/main/java/org/apache/manifoldcf/core/database/DBInterfaceMySQL.java
would allow us to see what code to catch, when it happened again.

For the Web connector, the only modifications have been in regards to
how it handles 500 errors, which now correctly code to avoid an
IndexExceptionOutOfBounds exception.  This has nothing to do with
socket exceptions, which are caused for external reasons only.

Karl


On Wed, Oct 10, 2012 at 10:32 PM, Shigeki Kobayashi
<shigeki.kobayashi3@g.softbank.co.jp> wrote:
> Hi Karl,
>
>
> I was comparing version 1.0 with old trunk based on version 0.6 implementing
> CONNECTORS-501(
> Medium-scale web crawl with hopcount-based filtering fails to find correct
> number of documents).
>
> Running each version with the same MySQL setting and the same throttling,
> somehow the version 1.0 hangs with the error.
> Since the old trunk completes crawling, I wonder if something has changed.
>
> Just to make sure I will recheck if there are any wrong settings in MCF.
>
> Thanks.
>
> Regards,
>
> Shigeki
>
> 2012/10/10 Karl Wright <daddywri@gmail.com>
>>
>> Hi Shigeki,
>>
>> The socket timeout exception is only a warning.  It means that some
>> site you are crawling did not accept a socket connection within the
>> allowed time (5 minutes I think).  The Web Connector will retry the
>> connection a few times, and if it is still rejected, it will
>> eventually give up on that page.  One thing you want to check, though,
>> is that you are using proper throttling, because if you aren't then
>> one cause of this problem is that the webmaster of the site you are
>> trying to crawl may have blocked you from accessing it.
>>
>> The database exception is more problematic.  It means that MySQL
>> thinks it took too long for a specific transaction to complete, and
>> the database aborted the transaction due to a timeout.  There are two
>> ways of dealing with this issue.  One way is to modify your MySQL
>> configuration to increase the transaction timeout value to some high
>> number.  The second way is to modify ManifoldCF to recognize the
>> timeout error specifically, and cause a retry.  But in order to do the
>> latter, I would need to know what SQL error code MySQL returns for
>> this situation, which will mean we either need to look it up (if we
>> can), or modify a ManifoldCF instance to log it when this problem
>> occurs.
>>
>> Please let me know how you would like to proceed.
>>
>> Karl
>>
>> On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi
>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >
>> > Hi
>> >
>> > I am having a trouble with crawling web using MCF1.0.
>> > I run MCF with MySQL 5.5 and Tomcat 6.0.
>> > It should keep crawling contents, but MCF prints the following Database
>> > exception log, then hangs.
>> > After DB Exception, Socket Time Exception occurs.
>> >
>> > Anyone has faced this problem?
>> >
>> > --Database Exception log:
>> >
>> > ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread
>> > aborting
>> > and restarting due to database connection reset: Database exception:
>> > Exception doing query: Lock wait timeout exceeded; try restarting
>> > transaction
>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
>> > exception: Exception doing query: Lock wait timeout exceeded; try
>> > restarting
>> > transaction
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
>> >         at
>> >
>> > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:6049)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessAcivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:6159)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:52)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:225)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:7047)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:6011)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1282)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
>> > restarting
>> > transaction
>> >         at
>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
>> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
>> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
>> >         at
>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
>> >         at
>> >
>> > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
>> >         at
>> >
>> > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
>> >         at
>> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
>> > ERROR 2012-10-10 16:11:06,799 (Worker thread '9') - Worker thread
>> > aborting
>> > and restarting due to database connection reset: Database exception:
>> > Exception doing query: Lock wait timeout exceeded; try restarting
>> > transaction
>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
>> > exception: Exception doing query: Lock wait timeout exceeded; try
>> > restarting
>> > transaction
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
>> >         at
>> >
>> > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554)
>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
>> > restarting
>> > transaction
>> >         at
>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
>> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
>> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
>> >         at
>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
>> >         at
>> >
>> > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
>> >         at
>> >
>> > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
>> >         at
>> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
>> >         at
>> >
>> > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
>> >
>> >
>> >
>> > ---- Socket Timeout:
>> >
>> >
>> > DEBUG 2012-10-10 16:16:27,256 (Worker thread '49') - Socket timeout
>> > exception trying to close connection: Read timed out
>> > java.net.SocketTimeoutException: Read timed out
>> >         at java.net.SocketInputStream.socketRead0(Native Method)
>> >         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>> >         at
>> > java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>> >         at
>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>> >         at
>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >         at
>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>> > Source)
>> >         at
>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>> > Source)
>> >         at
>> >
>> > org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown
>> > Source)
>> >         at
>> > org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown
>> > Source)
>> >         at java.io.FilterInputStream.close(FilterInputStream.java:155)
>> >         at
>> > org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown
>> > Source)
>> >         at
>> > org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown
>> > Source)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.close(ThrottledFetcher.java:2082)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:176)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
>> >  INFO 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: FETCH
>> >
>> > URL|http://xxxxxx/...|1349852786744+600514|-104|4125|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>> > Interrupted: Socket timeout: Read timed out
>> > DEBUG 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: Fetch
>> > exception
>> > for 'http://xxxxxx/...'
>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted:
>> > Socket timeout: Read timed out
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1818)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:797)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
>> > Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption:
>> > Socket timeout: Read timed out
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:101)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
>> >         ... 1 more
>> > Caused by: java.net.SocketTimeoutException: Read timed out
>> >         at java.net.SocketInputStream.socketRead0(Native Method)
>> >         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>> >         at
>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>> >         at
>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >         at
>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>> > Source)
>> >         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>> >         at
>> > org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
>> > Source)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2012)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1976)
>> >         at
>> >
>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95)
>> >         ... 2 more
>> >  WARN 2012-10-10 16:16:27,274 (Worker thread '49') - Pre-ingest service
>> > interruption reported for job 1349774325961 connection 'WEB': Socket
>> > timeout: Read timed out
>> >
>> >
>> >
>> > Regards,
>> >
>> > Shigeki
>
>
>
>

Mime
View raw message