manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawl that doesn't complete and robot.txt error
Date Mon, 10 Feb 2014 18:41:01 GMT
Hi Mark,

The robots parse error is informational only and does not otherwise affect
crawling.  So you will need to look elsewhere for the issue.

First question: what version of MCF are you using?  For a time, trunk (and
the release 1.5 branch) had exactly this problem whenever connections were
used that included certificates.

I suggest that you rule out blocked sites by looking at the simple
history.  If you see a lot of rejections then maybe you are being blocked.
If, on the other hand, not much has happened at all for a while, that's not
the answer.

The fastest way to start diagnosing this problem is to get a thread dump.
I'd be happy to look at it and let you know what I find.

Karl





On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <mlibucha@gmail.com> wrote:

> I kicked off a bunch of web crawls on Friday to run over the weekend. They
> all started fine but didn't finish. No errors in the logs I can find. All
> action seemed to stop after a couple of hours. It's configured as complete
> crawl that runs every 24 hours.
>
> I don't expect you to have an answer to what went wrong with such limited
> information, but I did see a problem with robots.txt (at the bottom of this
> email).
>
> Does it mean robots.txt was not used at all for the crawl, or just that
> part was ignored? (I kind of expected this kind of error to kill the crawl,
> but maybe I just don't understand it.)
>
> If the crawl were ignoring the robots.txt, or a part of it, and the
> crawled site banned my crawler, what would I see in the MCF logs?
>
> Thanks,
>
> Mark
>
> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
> http://www.somesite.gov/sitemapindex.xml>'
>

Mime
View raw message