manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Web crawler does not follow the robots meta tag rules
Date Thu, 27 Jan 2011 17:43:28 GMT
I've written the necessary code for ManifoldCF, so if you create the
ticket, I can attach a patch.  I don't know if it works yet, but I
presume you will be in a position to try it out?

Karl

On Thu, Jan 27, 2011 at 10:55 AM, Erlend Garåsen
<e.f.garasen@usit.uio.no> wrote:
>
> Thanks for your reply.
>
> OK, now I got two home lessons:
> - Create a Jira issue about this
> - Explain how it is possible to use ExtractingRequestHandler with Solr 1.4.1
> by copying jars etc.
>
> BTW, I just figured out that Tika parses all the meta tag information, so I
> can rewrite the ExtractingRequestHandler classes in order to skip files with
> these meta directives. The following was included into my index last time i
> started the ManifoldCF job:
> <arr name="ignored_meta">
> <str>robots</str>
> <str>noindex,nofollow</str>
>
> I have already rewritten some of these classes in order to implement
> language detection, so it seems that we can implement all the functionality
> we need by using ManifoldCF. :)
>
> Erlend
>
> On 27.01.11 16.37, Karl Wright wrote:
>>
>> There's also ordering; the meta tag must precede all links on the page
>> that you don't want the crawler to follow.  Hope this is OK.
>>
>> Karl
>>
>> On Thu, Jan 27, 2011 at 10:16 AM, Karl Wright<daddywri@gmail.com>  wrote:
>>>
>>> Sure, please open a ticket.
>>> Interpreting the tag should not be difficult.  The main issues will be
>>> around noting the crawler's decision to skip documents or content in
>>> the activities history.  And, of course, this will not be available in
>>> the ManifoldCF-0.1-incubating release.
>>>
>>> Please specify what variants of the tag you think should be supported,
>>> and if supported, how you think it should work.  For example,
>>> including "nofollow" does not usually block crawlers from reaching
>>> your linked documents from other directions; if you want that
>>> functionality, you probably won't find that anywhere.  This is why
>>> most people use robots.txt rather than the meta tag.
>>>
>>> Karl
>>>
>>>
>>> On Thu, Jan 27, 2011 at 10:04 AM, Erlend Garåsen
>>> <e.f.garasen@usit.uio.no>  wrote:
>>>>
>>>> I just figured out that the web crawler does not follow the rules
>>>> defined by
>>>> the robots meta tag. I created a document with the following tag:
>>>> <meta name="robots" content="noindex, nofollow">
>>>>
>>>> This document has also a link to another document in order to test the
>>>> "nofollow" rule, but both documents were fetched and indexed by Solr.
>>>>
>>>> Should I open a Jira issue about this? I hope it's easy to rewrite the
>>>> crawler in order to add this functionality since this is a blocker for
>>>> us.
>>>>
>>>> Erlend
>>>>
>>>> --
>>>> Erlend Garåsen
>>>> Center for Information Technology Services
>>>> University of Oslo
>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>> 31050
>>>>
>>>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>

Mime
View raw message