manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawling , robots.txt and access credentials
Date Tue, 16 Sep 2014 15:44:04 GMT
Authentication does not bypass robots ever.

You will want to turn on connector debug logging to see the decisions that
the web connector is making with respect to which documents are fetched or
not fetched, and why.

Karl


On Tue, Sep 16, 2014 at 11:04 AM, Bisonti Mario <Mario.Bisonti@vimar.com>
wrote:

>
>
> *Hallo.*
>
>
>
> I would like to crawl some documents in a subfolder of a web site:
>
> http://aaa.bb.com/
>
>
>
> Structure is:
>
> http://aaa.bb.com/ccc/folder1
>
> http://aaa.bb.com/ccc/folder2
>
> http://aaa.bb.com/ccc/folder3
>
>
>
> Folder ccc and subfolder, are with a Basic security
> username: joe
>
> Password: ppppp
>
>
>
> I want to permit the crawling of only some docs on folder1
>
> So I put robots.txt on
>
> http://aaa.bb.com/ccc/robots.txt
>
>
>
> The contents of file robots.txt is
>
> User-agent: *
>
> Disallow: /
>
> Allow: folder1/doc1.pdf
>
> Allow: folder1/doc2.pdf
>
> Allow: folder1/doc3.pdf
>
>
>
>
>
> I setup on MCF 1.7 a repository web connection with:
> “Obey robots.txt for all fetches”
> and on Access credentials:
> http://aaa.bb.com/ccc/
>
> Basic authentication: joe and ppp
>
>
>
> When I create a job :
>
> Include in crawl : .*
>
> Include in index: .*
>
> Include only hosts matching seeds? X
>
>
>
> and I start it, it happens that it crawls all the content of folder1,
> folder2, and folder3,
>
> instead, as I expected, only the :
>
> http://aaa.bb.com/ccc/folder1/doc1.pdf
>
>
>
> http://aaa.bb.com/ccc/folder1/doc2.pdf
>
>
>
> http://aaa.bb.com/ccc/folder1/doc3.pdf
>
>
>
>
>
> Why this?
>
>
>
> Perhaps the Basic Authentication, bypass the specific “Obey robots.txt for
> all fetches” ?
>
>
>
> Thanks a lot for your help.
>
> Mario
>
>
>

Mime
View raw message