manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawler does not follow the robots meta tag rules
Date Thu, 27 Jan 2011 15:37:56 GMT
There's also ordering; the meta tag must precede all links on the page
that you don't want the crawler to follow.  Hope this is OK.

Karl

On Thu, Jan 27, 2011 at 10:16 AM, Karl Wright <daddywri@gmail.com> wrote:
> Sure, please open a ticket.
> Interpreting the tag should not be difficult.  The main issues will be
> around noting the crawler's decision to skip documents or content in
> the activities history.  And, of course, this will not be available in
> the ManifoldCF-0.1-incubating release.
>
> Please specify what variants of the tag you think should be supported,
> and if supported, how you think it should work.  For example,
> including "nofollow" does not usually block crawlers from reaching
> your linked documents from other directions; if you want that
> functionality, you probably won't find that anywhere.  This is why
> most people use robots.txt rather than the meta tag.
>
> Karl
>
>
> On Thu, Jan 27, 2011 at 10:04 AM, Erlend Garåsen
> <e.f.garasen@usit.uio.no> wrote:
>>
>> I just figured out that the web crawler does not follow the rules defined by
>> the robots meta tag. I created a document with the following tag:
>> <meta name="robots" content="noindex, nofollow">
>>
>> This document has also a link to another document in order to test the
>> "nofollow" rule, but both documents were fetched and indexed by Solr.
>>
>> Should I open a Jira issue about this? I hope it's easy to rewrite the
>> crawler in order to add this functionality since this is a blocker for us.
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>
>

Mime
View raw message