manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawling causes Socket Timeout after Database Exception
Date Fri, 19 Oct 2012 10:22:04 GMT
I just looked in the code with svn for differences in the web
connector from release 0.6.  There is a change to the html parser to
allow for handling default values for <option> tags, and a change that
fixes an IndexOutOfBounds exception.  Neither of these can possibly
affect socket timeouts.

I also looked at the solr connector (presuming that is what you are
using as an output connector).  No changes at all since 0.6.

So honestly, I can see no significant changes whatsoever in the
behavior of how a web crawler indexing into Solr would behave.  If you
are seeing differences, therefore, I simply cannot account for them.

Karl


On Fri, Oct 19, 2012 at 5:01 AM, Shigeki Kobayashi
<shigeki.kobayashi3@g.softbank.co.jp> wrote:
> Due to the error, I had to downgrade to a lower version so I haven't found
> the MySQL error code yet.
>
> I installed MCF1.0 in a different environment where crawlable contents are
> different from the above environment.
> I could not reproduce the Database exception but socket timeout occurred  In
> the same environment, I ran MCF0.6 and it completed crawling without socket
> timeout.
> Like you said, socket timeout seems to be a different problem from the
> Database exception .
>
> 2012/10/18 Karl Wright <daddywri@gmail.com>
>>
>> So, what was the resolution of this problem?  Any news?
>> Karl
>>
>> On Thu, Oct 11, 2012 at 2:28 AM, Karl Wright <daddywri@gmail.com> wrote:
>> > The only change is that the MySQL driver now performs ANALYZE
>> > operations on the fly in order to keep the database operating at high
>> > efficiency.  This is CONNECTORS-510.  It is possible that, on a large
>> > database table, these operations will cause others to wait long enough
>> > so that their timeout is exceeded.  Such an event does not take place
>> > while the load tests run, however.  If you want to turn off the
>> > analyze operation, you can do that by setting a per-table property to
>> > override the analyze default of 10000 operations:
>> >
>> > analyzeThreshold =
>> >
>> > ManifoldCF.getIntProperty("org.apache.manifold.db.mysql.analyze."+tableName,10000);
>> >
>> > The table in question is "jobqueue".  If you set this value to
>> > something like 1000000000 and you still see MySQL timeouts, then this
>> > new code is not the problem.  And, like I said, the best solution is
>> > to recognize the error and retry, but first I would need the error
>> > code.  Adding an appropriate output of sqlState around line 123 of
>> >
>> > framework/core/src/main/java/org/apache/manifoldcf/core/database/DBInterfaceMySQL.java
>> > would allow us to see what code to catch, when it happened again.
>> >
>> > For the Web connector, the only modifications have been in regards to
>> > how it handles 500 errors, which now correctly code to avoid an
>> > IndexExceptionOutOfBounds exception.  This has nothing to do with
>> > socket exceptions, which are caused for external reasons only.
>> >
>> > Karl
>> >
>> >
>> > On Wed, Oct 10, 2012 at 10:32 PM, Shigeki Kobayashi
>> > <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >> Hi Karl,
>> >>
>> >>
>> >> I was comparing version 1.0 with old trunk based on version 0.6
>> >> implementing
>> >> CONNECTORS-501(
>> >> Medium-scale web crawl with hopcount-based filtering fails to find
>> >> correct
>> >> number of documents).
>> >>
>> >> Running each version with the same MySQL setting and the same
>> >> throttling,
>> >> somehow the version 1.0 hangs with the error.
>> >> Since the old trunk completes crawling, I wonder if something has
>> >> changed.
>> >>
>> >> Just to make sure I will recheck if there are any wrong settings in
>> >> MCF.
>> >>
>> >> Thanks.
>> >>
>> >> Regards,
>> >>
>> >> Shigeki
>> >>
>> >> 2012/10/10 Karl Wright <daddywri@gmail.com>
>> >>>
>> >>> Hi Shigeki,
>> >>>
>> >>> The socket timeout exception is only a warning.  It means that some
>> >>> site you are crawling did not accept a socket connection within the
>> >>> allowed time (5 minutes I think).  The Web Connector will retry the
>> >>> connection a few times, and if it is still rejected, it will
>> >>> eventually give up on that page.  One thing you want to check, though,
>> >>> is that you are using proper throttling, because if you aren't then
>> >>> one cause of this problem is that the webmaster of the site you are
>> >>> trying to crawl may have blocked you from accessing it.
>> >>>
>> >>> The database exception is more problematic.  It means that MySQL
>> >>> thinks it took too long for a specific transaction to complete, and
>> >>> the database aborted the transaction due to a timeout.  There are two
>> >>> ways of dealing with this issue.  One way is to modify your MySQL
>> >>> configuration to increase the transaction timeout value to some high
>> >>> number.  The second way is to modify ManifoldCF to recognize the
>> >>> timeout error specifically, and cause a retry.  But in order to do the
>> >>> latter, I would need to know what SQL error code MySQL returns for
>> >>> this situation, which will mean we either need to look it up (if we
>> >>> can), or modify a ManifoldCF instance to log it when this problem
>> >>> occurs.
>> >>>
>> >>> Please let me know how you would like to proceed.
>> >>>
>> >>> Karl
>> >>>
>> >>> On Wed, Oct 10, 2012 at 3:51 AM, Shigeki Kobayashi
>> >>> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
>> >>> >
>> >>> > Hi
>> >>> >
>> >>> > I am having a trouble with crawling web using MCF1.0.
>> >>> > I run MCF with MySQL 5.5 and Tomcat 6.0.
>> >>> > It should keep crawling contents, but MCF prints the following
>> >>> > Database
>> >>> > exception log, then hangs.
>> >>> > After DB Exception, Socket Time Exception occurs.
>> >>> >
>> >>> > Anyone has faced this problem?
>> >>> >
>> >>> > --Database Exception log:
>> >>> >
>> >>> > ERROR 2012-10-10 16:11:05,787 (Worker thread '42') - Worker thread
>> >>> > aborting
>> >>> > and restarting due to database connection reset: Database exception:
>> >>> > Exception doing query: Lock wait timeout exceeded; try restarting
>> >>> > transaction
>> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
>> >>> > exception: Exception doing query: Lock wait timeout exceeded; try
>> >>> > restarting
>> >>> > transaction
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.addDocumentReference(WorkerThread.java:1487)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessActivityLinkHandler.noteDiscoveredLink(WebcrawlerConnector.java:6049)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector$ProcessAcivityHTMLHandler.noteAHREF(WebcrawlerConnector.java:6159)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.LinkParseState.noteNonscriptTag(LinkParseState.java:44)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.FormParseState.noteNonscriptTag(FormParseState.java:52)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ScriptParseState.noteTag(ScriptParseState.java:50)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.BasicParseState.dealWithCharacter(BasicParseState.java:225)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleHTML(WebcrawlerConnector.java:7047)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:6011)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:1282)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
>> >>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
>> >>> > restarting
>> >>> > transaction
>> >>> >         at
>> >>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
>> >>> >         at
>> >>> > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
>> >>> >         at
>> >>> > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
>> >>> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
>> >>> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
>> >>> >         at
>> >>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
>> >>> >         at
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
>> >>> > ERROR 2012-10-10 16:11:06,799 (Worker thread '9') - Worker thread
>> >>> > aborting
>> >>> > and restarting due to database connection reset: Database exception:
>> >>> > Exception doing query: Lock wait timeout exceeded; try restarting
>> >>> > transaction
>> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database
>> >>> > exception: Exception doing query: Lock wait timeout exceeded; try
>> >>> > restarting
>> >>> > transaction
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database.executeViaThread(Database.java:681)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:709)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1394)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:144)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:186)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:852)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.jobs.JobManager.addDocuments(JobManager.java:4089)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.processDocumentReferences(WorkerThread.java:1932)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.flush(WorkerThread.java:1863)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:554)
>> >>> > Caused by: java.sql.SQLException: Lock wait timeout exceeded; try
>> >>> > restarting
>> >>> > transaction
>> >>> >         at
>> >>> > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
>> >>> >         at
>> >>> > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
>> >>> >         at
>> >>> > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
>> >>> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2002)
>> >>> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2163)
>> >>> >         at
>> >>> > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2624)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2127)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2293)
>> >>> >         at
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database.execute(Database.java:826)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.core.database.Database$ExecuteQueryThread.run(Database.java:641)
>> >>> >
>> >>> >
>> >>> >
>> >>> > ---- Socket Timeout:
>> >>> >
>> >>> >
>> >>> > DEBUG 2012-10-10 16:16:27,256 (Worker thread '49') - Socket timeout
>> >>> > exception trying to close connection: Read timed out
>> >>> > java.net.SocketTimeoutException: Read timed out
>> >>> >         at java.net.SocketInputStream.socketRead0(Native Method)
>> >>> >         at
>> >>> > java.net.SocketInputStream.read(SocketInputStream.java:129)
>> >>> >         at
>> >>> > java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>> >>> >         at
>> >>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>> >>> >         at
>> >>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >>> >         at
>> >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>> >>> > Source)
>> >>> >         at
>> >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>> >>> > Source)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown
>> >>> > Source)
>> >>> >         at
>> >>> > org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown
>> >>> > Source)
>> >>> >         at
>> >>> > java.io.FilterInputStream.close(FilterInputStream.java:155)
>> >>> >         at
>> >>> >
>> >>> > org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown
>> >>> > Source)
>> >>> >         at
>> >>> > org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown
>> >>> > Source)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.close(ThrottledFetcher.java:2082)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:176)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
>> >>> >  INFO 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: FETCH
>> >>> >
>> >>> >
>> >>> > URL|http://xxxxxx/...|1349852786744+600514|-104|4125|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>> >>> > Interrupted: Socket timeout: Read timed out
>> >>> > DEBUG 2012-10-10 16:16:27,273 (Worker thread '49') - WEB: Fetch
>> >>> > exception
>> >>> > for 'http://xxxxxx/...'
>> >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> >>> > Interrupted:
>> >>> > Socket timeout: Read timed out
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1818)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:797)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:321)
>> >>> > Caused by:
>> >>> > org.apache.manifoldcf.agents.interfaces.ServiceInterruption:
>> >>> > Socket timeout: Read timed out
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:101)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:745)
>> >>> >         ... 1 more
>> >>> > Caused by: java.net.SocketTimeoutException: Read timed out
>> >>> >         at java.net.SocketInputStream.socketRead0(Native Method)
>> >>> >         at
>> >>> > java.net.SocketInputStream.read(SocketInputStream.java:129)
>> >>> >         at
>> >>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>> >>> >         at
>> >>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>> >>> >         at
>> >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
>> >>> > Source)
>> >>> >         at
>> >>> > java.io.FilterInputStream.read(FilterInputStream.java:116)
>> >>> >         at
>> >>> > org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
>> >>> > Source)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.basicRead(ThrottledFetcher.java:2012)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledInputstream.read(ThrottledFetcher.java:1976)
>> >>> >         at
>> >>> >
>> >>> >
>> >>> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:95)
>> >>> >         ... 2 more
>> >>> >  WARN 2012-10-10 16:16:27,274 (Worker thread '49') - Pre-ingest
>> >>> > service
>> >>> > interruption reported for job 1349774325961 connection 'WEB': Socket
>> >>> > timeout: Read timed out
>> >>> >
>> >>> >
>> >>> >
>> >>> > Regards,
>> >>> >
>> >>> > Shigeki
>> >>
>> >>
>> >>
>> >>
>
>
>
>

Mime
View raw message