manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend Garåsen <e.f.gara...@usit.uio.no>
Subject Re: Web crawler does not follow the robots meta tag rules
Date Thu, 27 Jan 2011 15:55:19 GMT

Thanks for your reply.

OK, now I got two home lessons:
- Create a Jira issue about this
- Explain how it is possible to use ExtractingRequestHandler with Solr 
1.4.1 by copying jars etc.

BTW, I just figured out that Tika parses all the meta tag information, 
so I can rewrite the ExtractingRequestHandler classes in order to skip 
files with these meta directives. The following was included into my 
index last time i started the ManifoldCF job:
<arr name="ignored_meta">
<str>robots</str>
<str>noindex,nofollow</str>

I have already rewritten some of these classes in order to implement 
language detection, so it seems that we can implement all the 
functionality we need by using ManifoldCF. :)

Erlend

On 27.01.11 16.37, Karl Wright wrote:
> There's also ordering; the meta tag must precede all links on the page
> that you don't want the crawler to follow.  Hope this is OK.
>
> Karl
>
> On Thu, Jan 27, 2011 at 10:16 AM, Karl Wright<daddywri@gmail.com>  wrote:
>> Sure, please open a ticket.
>> Interpreting the tag should not be difficult.  The main issues will be
>> around noting the crawler's decision to skip documents or content in
>> the activities history.  And, of course, this will not be available in
>> the ManifoldCF-0.1-incubating release.
>>
>> Please specify what variants of the tag you think should be supported,
>> and if supported, how you think it should work.  For example,
>> including "nofollow" does not usually block crawlers from reaching
>> your linked documents from other directions; if you want that
>> functionality, you probably won't find that anywhere.  This is why
>> most people use robots.txt rather than the meta tag.
>>
>> Karl
>>
>>
>> On Thu, Jan 27, 2011 at 10:04 AM, Erlend Garåsen
>> <e.f.garasen@usit.uio.no>  wrote:
>>>
>>> I just figured out that the web crawler does not follow the rules defined by
>>> the robots meta tag. I created a document with the following tag:
>>> <meta name="robots" content="noindex, nofollow">
>>>
>>> This document has also a link to another document in order to test the
>>> "nofollow" rule, but both documents were fetched and indexed by Solr.
>>>
>>> Should I open a Jira issue about this? I hope it's easy to rewrite the
>>> crawler in order to add this functionality since this is a blocker for us.
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>>
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message