manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Libucha <mlibu...@gmail.com>
Subject Web crawl that doesn't complete and robot.txt error
Date Mon, 10 Feb 2014 18:26:14 GMT
I kicked off a bunch of web crawls on Friday to run over the weekend. They
all started fine but didn't finish. No errors in the logs I can find. All
action seemed to stop after a couple of hours. It's configured as complete
crawl that runs every 24 hours.

I don't expect you to have an answer to what went wrong with such limited
information, but I did see a problem with robots.txt (at the bottom of this
email).

Does it mean robots.txt was not used at all for the crawl, or just that
part was ignored? (I kind of expected this kind of error to kill the crawl,
but maybe I just don't understand it.)

If the crawl were ignoring the robots.txt, or a part of it, and the crawled
site banned my crawler, what would I see in the MCF logs?

Thanks,

Mark

02-09-2014 09:54:48.679robots parsesomesite.gov:80
ERRORS01Unknown robots.txt line: 'Sitemap: <
http://www.somesite.gov/sitemapindex.xml>'

Mime
View raw message