manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawling , robots.txt and access credentials
Date Tue, 16 Sep 2014 17:21:31 GMT
Hi Mario,

I looked at your robots.txt.  In its current form, it should disallow
EVERYTHING from your site.  The reason is that some of your paths start
with "/", but the allow clauses do not.

As for why MCF is letting files through, I suspect that this is because MCF
caches robots data.  If you changed the file and expected MCF to pick that
up immediately, it won't.  The cached copy expires after, I believe, 1
hour.  It's kept in the database so even if you recycle the agents process
it won't purge the cache.

Karl


On Tue, Sep 16, 2014 at 11:44 AM, Karl Wright <daddywri@gmail.com> wrote:

> Authentication does not bypass robots ever.
>
> You will want to turn on connector debug logging to see the decisions that
> the web connector is making with respect to which documents are fetched or
> not fetched, and why.
>
> Karl
>
>
> On Tue, Sep 16, 2014 at 11:04 AM, Bisonti Mario <Mario.Bisonti@vimar.com>
> wrote:
>
>>
>>
>> *Hallo.*
>>
>>
>>
>> I would like to crawl some documents in a subfolder of a web site:
>>
>> http://aaa.bb.com/
>>
>>
>>
>> Structure is:
>>
>> http://aaa.bb.com/ccc/folder1
>>
>> http://aaa.bb.com/ccc/folder2
>>
>> http://aaa.bb.com/ccc/folder3
>>
>>
>>
>> Folder ccc and subfolder, are with a Basic security
>> username: joe
>>
>> Password: ppppp
>>
>>
>>
>> I want to permit the crawling of only some docs on folder1
>>
>> So I put robots.txt on
>>
>> http://aaa.bb.com/ccc/robots.txt
>>
>>
>>
>> The contents of file robots.txt is
>>
>> User-agent: *
>>
>> Disallow: /
>>
>> Allow: folder1/doc1.pdf
>>
>> Allow: folder1/doc2.pdf
>>
>> Allow: folder1/doc3.pdf
>>
>>
>>
>>
>>
>> I setup on MCF 1.7 a repository web connection with:
>> “Obey robots.txt for all fetches”
>> and on Access credentials:
>> http://aaa.bb.com/ccc/
>>
>> Basic authentication: joe and ppp
>>
>>
>>
>> When I create a job :
>>
>> Include in crawl : .*
>>
>> Include in index: .*
>>
>> Include only hosts matching seeds? X
>>
>>
>>
>> and I start it, it happens that it crawls all the content of folder1,
>> folder2, and folder3,
>>
>> instead, as I expected, only the :
>>
>> http://aaa.bb.com/ccc/folder1/doc1.pdf
>>
>>
>>
>> http://aaa.bb.com/ccc/folder1/doc2.pdf
>>
>>
>>
>> http://aaa.bb.com/ccc/folder1/doc3.pdf
>>
>>
>>
>>
>>
>> Why this?
>>
>>
>>
>> Perhaps the Basic Authentication, bypass the specific “Obey robots.txt
>> for all fetches” ?
>>
>>
>>
>> Thanks a lot for your help.
>>
>> Mario
>>
>>
>>
>
>

Mime
View raw message