manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Libucha <mlibu...@gmail.com>
Subject Re: Web crawl that doesn't complete and robot.txt error
Date Mon, 10 Feb 2014 19:18:03 GMT
Thanks Karl, we may take you up on the offer when/if we reproduce with just
a single crawl. We were running many at once. Can you describe or point me
at instructions for the thread dump you'd like to see?

We're using 1.4.1.

The simple history looks clean. All 200s and OKs, with a few broken pipes,
but those documents all seem to have been successfully fetch later. No
rejects.

Thanks again,

Mark



On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Mark,
>
> The robots parse error is informational only and does not otherwise affect
> crawling.  So you will need to look elsewhere for the issue.
>
> First question: what version of MCF are you using?  For a time, trunk (and
> the release 1.5 branch) had exactly this problem whenever connections were
> used that included certificates.
>
> I suggest that you rule out blocked sites by looking at the simple
> history.  If you see a lot of rejections then maybe you are being blocked.
> If, on the other hand, not much has happened at all for a while, that's not
> the answer.
>
> The fastest way to start diagnosing this problem is to get a thread dump.
> I'd be happy to look at it and let you know what I find.
>
> Karl
>
>
>
>
>
> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <mlibucha@gmail.com> wrote:
>
>> I kicked off a bunch of web crawls on Friday to run over the weekend.
>> They all started fine but didn't finish. No errors in the logs I can find.
>> All action seemed to stop after a couple of hours. It's configured as
>> complete crawl that runs every 24 hours.
>>
>> I don't expect you to have an answer to what went wrong with such limited
>> information, but I did see a problem with robots.txt (at the bottom of this
>> email).
>>
>> Does it mean robots.txt was not used at all for the crawl, or just that
>> part was ignored? (I kind of expected this kind of error to kill the crawl,
>> but maybe I just don't understand it.)
>>
>> If the crawl were ignoring the robots.txt, or a part of it, and the
>> crawled site banned my crawler, what would I see in the MCF logs?
>>
>> Thanks,
>>
>> Mark
>>
>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
>> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
>> http://www.somesite.gov/sitemapindex.xml>'
>>
>
>

Mime
View raw message