manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shigeki Kobayashi <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: Web crawling causes Socket Timeout after Database Exception
Date Fri, 19 Oct 2012 09:01:13 GMT
Due to the error, I had to downgrade to a lower version so I haven't found
the MySQL error code yet.

I installed MCF1.0 in a different environment where crawlable contents are
different from the above environment.
I could not reproduce the Database exception but socket timeout occurred
In the same environment, I ran MCF0.6 and it completed crawling without
socket timeout.
Like you said, socket timeout seems to be a different problem from the
Database exception .

2012/10/18 Karl Wright <daddywri@gmail.com>

> So, what was the resolution of this problem?  Any news?
> Karl
>
> On Thu, Oct 11, 2012 at 2:28 AM, Karl Wright <daddywri@gmail.com> wrote:
> > The only change is that the MySQL driver now performs ANALYZE
> > operations on the fly in order to keep the database operating at high
> > efficiency.  This is CONNECTORS-510.  It is possible that, on a large
> > database table, these operations will cause others to wait long enough
> > so that their timeout is exceeded.  Such an event does not take place
> > while the load tests run, however.  If you want to turn off the
> > analyze operation, you can do that by setting a per-table property to
> > override the analyze default of 10000 operations:
> >
> > analyzeThreshold =
> >
> ManifoldCF.getIntProperty("org.apache.manifold.db.mysql.analyze."+tableName,10000);
> >
> > The table in question is "jobqueue".  If you set this value to
> > something like 1000000000 and you still see MySQL timeouts, then this
> > new code is not the problem.  And, like I said, the best solution is
> > to recognize the error and retry, but first I would need the error
> > code.  Adding an appropriate output of sqlState around line 123 of
> >
> framework/core/src/main/java/org/apache/manifoldcf/core/database/DBInterfaceMySQL.java
> > would allow us to see what code to catch, when it happened again.
> >
> > For the Web connector, the only modifications have been in regards to
> > how it handles 500 errors, which now correctly code to avoid an
> > IndexExceptionOutOfBounds exception.  This has nothing to do with
> > socket exceptions, which are caused for external reasons only.
> >
> > Karl
> >
> >
> > On Wed, Oct 10, 2012 at 10:32 PM, Shigeki Kobayashi
> > <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >> Hi Karl,
> >>
> >>
> >> I was comparing version 1.0 with old trunk based on version 0.6
> implementing
> >> CONNECTORS-501(
> >> Medium-scale web crawl with hopcount-based filtering fails to find
> correct
> >> number of documents).
> >>
> >> Running each version with the same MySQL setting and the same
> throttling,
> >> somehow the version 1.0 hangs with the error.
> >> Since the old trunk completes crawling, I wonder if something has
> changed.
> >>
> >> Just to make sure I will recheck if there are any wrong settings in MCF.
> >>
> >> Thanks.
> >>
> >> Regards,
> >>
> >> Shigeki
> >>
> >> 2012/10/10 Karl Wright <daddywri@gmail.com>
> >>>
> >>> Hi Shigeki,
> >>>
> >>> The socket timeout exception is only a warning.  It means that some
> >>> site you are crawling did not accept a socket connection within the
> >>> allowed time (5 minutes I think).  The Web Connector will retry the
> >>> connection a few times, and if it is still rejected, it will
> >>> eventually give up on that page.  One thing you want to check, though,
> >>> is that you are using proper throttling, because if you aren't then
> >>> one cause of this problem is that the webmaster of the site you are
> >>> trying to crawl may have blocked you from accessing it.
> >>>
> >>> The database exception is more problematic.  It means that MySQL
> >>> thinks it took too long for a specific transaction to complete, and
> >>> the database aborted the transaction due to a timeout.  There are two
> >>> ways of dealing with this issue.  One way is to modify your MySQL
> >>> configuration to increase the transaction timeout value to some high
> >>> number.  The second way is to modify ManifoldCF to recognize the
> >>> timeout error specifically, and cause a retry.  But in order to do the
> >>> latter, I would need to know what SQL error code MySQL returns for
> >>> this situation, which will mean we either need to look it up (if we
> >>> can), or modify a ManifoldCF instance to log it when this problem
> >>> occurs.
> >>>
> >>> Please let me know how you would like to proceed.
> >>>
> >>> Karl
> >>>
> >>> On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi
> >>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >>> >
> >>> > Hi
> >>> >
> >>> > I am having a trouble with crawling web using MCF1.0.
> >>> > I run MCF with MySQL 5.5 and Tomcat 6.0.
> >>> > It should keep crawling contents, but MCF prints the following
> Database
> >>> > exception log, then hangs.
> >>> > After DB Exception, Socket Time Exception occurs.
> >>> >
> >>> > Anyone has faced this problem?
> >>> >
> >>> > --Database Exception log:
> >>> >
> >>> > ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread
> >>> > aborting
> >>> > and restarting due to database connection reset: Database exception:
> >>> > Exception doing query: Lock wait timeout exceeded; try restarting
> >>> > transaction
> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> >>> > exception: Exception doing query: Lock wait timeout exceeded; try
> >>> > restarting
> >>> > transaction
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:6049)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessAcivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:6159)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:52)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:225)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:7047)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:6011)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1282)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
> >>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
> >>> > restarting
> >>> > transaction
> >>> >         at
> >>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
> >>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
> >>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
> >>> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
> >>> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
> >>> >         at
> >>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
> >>> >         at
> >>> >
> >>> >
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
> >>> >         at
> >>> >
> >>> >
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
> >>> >         at
> >>> >
> org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
> >>> > ERROR 2012-10-10 16:11:06,799 (Worker thread '9') - Worker thread
> >>> > aborting
> >>> > and restarting due to database connection reset: Database exception:
> >>> > Exception doing query: Lock wait timeout exceeded; try restarting
> >>> > transaction
> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
> >>> > exception: Exception doing query: Lock wait timeout exceeded; try
> >>> > restarting
> >>> > transaction
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554)
> >>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
> >>> > restarting
> >>> > transaction
> >>> >         at
> >>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
> >>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
> >>> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
> >>> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
> >>> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
> >>> >         at
> >>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
> >>> >         at
> >>> >
> >>> >
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
> >>> >         at
> >>> >
> >>> >
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
> >>> >         at
> >>> >
> org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
> >>> >
> >>> >
> >>> >
> >>> > ---- Socket Timeout:
> >>> >
> >>> >
> >>> > DEBUG 2012-10-10 16:16:27,256 (Worker thread '49') - Socket timeout
> >>> > exception trying to close connection: Read timed out
> >>> > java.net.SocketTimeoutException: Read timed out
> >>> >         at java.net.SocketInputStream.socketRead0(Native Method)
> >>> >         at
> java.net.SocketInputStream.read(SocketInputStream.java:129)
> >>> >         at
> >>> > java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> >>> >         at
> >>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> >>> >         at
> >>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>> >         at
> >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> >>> > Source)
> >>> >         at
> >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> >>> > Source)
> >>> >         at
> >>> >
> >>> >
> org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown
> >>> > Source)
> >>> >         at
> >>> > org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown
> >>> > Source)
> >>> >         at
> java.io.FilterInputStream.close(FilterInputStream.java:155)
> >>> >         at
> >>> >
> org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown
> >>> > Source)
> >>> >         at
> >>> > org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown
> >>> > Source)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.close(ThrottledFetcher.java:2082)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:176)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
> >>> >  INFO 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: FETCH
> >>> >
> >>> > URL|
> http://xxxxxx/...|1349852786744+600514|-104|4125|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> >>> > Interrupted: Socket timeout: Read timed out
> >>> > DEBUG 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: Fetch
> >>> > exception
> >>> > for 'http://xxxxxx/...'
> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Interrupted:
> >>> > Socket timeout: Read timed out
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1818)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:797)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
> >>> > Caused by:
> org.apache.manifoldcf.agents.interfaces.ServiceInterruption:
> >>> > Socket timeout: Read timed out
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:101)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
> >>> >         ... 1 more
> >>> > Caused by: java.net.SocketTimeoutException: Read timed out
> >>> >         at java.net.SocketInputStream.socketRead0(Native Method)
> >>> >         at
> java.net.SocketInputStream.read(SocketInputStream.java:129)
> >>> >         at
> >>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >>> >         at
> >>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>> >         at
> >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> >>> > Source)
> >>> >         at java.io.FilterInputStream.read(FilterInputStream.java:116)
> >>> >         at
> >>> > org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
> >>> > Source)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2012)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1976)
> >>> >         at
> >>> >
> >>> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95)
> >>> >         ... 2 more
> >>> >  WARN 2012-10-10 16:16:27,274 (Worker thread '49') - Pre-ingest
> service
> >>> > interruption reported for job 1349774325961 connection 'WEB': Socket
> >>> > timeout: Read timed out
> >>> >
> >>> >
> >>> >
> >>> > Regards,
> >>> >
> >>> > Shigeki
> >>
> >>
> >>
> >>
>

Mime
View raw message