manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 小林 茂樹(情報システム本部 / サービス企画部) <shigeki.kobayas...@g.softbank.co.jp>
Subject Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused
Date Thu, 10 May 2012 01:54:50 GMT
Karl,

Thanks for the reply.


> For web crawling, no single URL failure will cause the job to
abort;

OK, so I understand if I want it stopped, I need to manually abort the job.


> You can check on the status of an individual URL by using the Document
Status report.

The Document Status report says the seed URL is "Waiting for Proecssing",
which makes sense because the connection is refused. The report does not
show retry count.

The MCF log outputs exception. Is this also expected behavior?:
-----

DEBUG 2012-05-10 10:10:48,215 (Worker thread '34') - WEB: Fetch exception
for 'http://xxx.xxx.xxx/index.html'

java.net.ConnectException: Connection refused

    at java.net.PlainSocketImpl.socketConnect(Native Method)

    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)

    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)

    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)

    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)

    at java.net.Socket.connect(Socket.java:529)

    at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)

    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:597)

    at
org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(Unknown
Source)

    at
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(Unknown
Source)

    at org.apache.commons.httpclient.HttpConnection.open(Unknown Source)

    at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(Unknown
Source)

    at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Unknown
Source)

    at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
Source)

    at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown
Source)

    at
org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)

 WARN 2012-05-10 10:10:48,216 (Worker thread '34') - Pre-ingest service
interruption reported for job 1335340623530 connection 'WEB': Timed out
waiting for a connection for 'http://xxx.xxx.xxx/index.html': Connection
refused

-----

Regards,

Shigeki

2012/5/9 Karl Wright <daddywri@gmail.com>

> Hi,
>
> ManifoldCF's web connector is, in general, very cautious about not
> offending the owners of sites.  If it concludes that the site has
> blocked access to a URL, it may remove the URL from its queue for
> politeness, which would prevent further crawling of that URL for the
> duration of the current job.  Under most cases, however, if a URL is
> temporarily unavailable, it will be requeued for crawling at a later
> time.  The typical pattern is to attempt to recrawl the URL
> periodically (e.g. every 5 minutes) for many hours before giving up on
> it.  For web crawling, no single URL failure will cause the job to
> abort; it will continue running until all the other URLs have been
> processed or forever (if the job is continuous).
>
> You can check on the status of an individual URL by using the Document
> Status report.  This report should tell you what ManifoldCF intends to
> do with a specific document.  If you locate one such URL and try out
> this report, what does it say?
>
> Karl
>
>
> On Tue, May 8, 2012 at 10:04 PM, 小林 茂樹(情報システム本部 / サービス企画部)
> <shigeki.kobayashi3@g.softbank.co.jp> wrote:
> >
> > Hi guys.
> >
> >
> >
> > I need some advice for stopping the MCF web crawler from a running state
> > when a network connection refused.
> >
> >
> >
> > I use MCF 0.5 with Solr 3.5. I was testing what would happen to the web
> > crawler when shutting down the web site that is to be crawled. I checked
> the
> > simple history and saw “Connection refused” with status code of “-1”,
> that
> > looked fine. But as I was waiting, the job status never changed and
> remained
> > running. The crawler never crawls in this situation, but when I opened
> the
> > web site, the crawler never started crawling again either.
> >
> > At least, somehow, I want the crawler to stop from running when a network
> > connection refused, but I don’t know how. Does anyone have any ideas?
> >
> >
> >
> >
> >
> >
>

Mime
View raw message