Return-Path: Delivered-To: apmail-incubator-connectors-user-archive@minotaur.apache.org Received: (qmail 5059 invoked from network); 27 Jan 2011 15:55:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 27 Jan 2011 15:55:52 -0000 Received: (qmail 49449 invoked by uid 500); 27 Jan 2011 15:55:51 -0000 Delivered-To: apmail-incubator-connectors-user-archive@incubator.apache.org Received: (qmail 49395 invoked by uid 500); 27 Jan 2011 15:55:50 -0000 Mailing-List: contact connectors-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: connectors-user@incubator.apache.org Delivered-To: mailing list connectors-user@incubator.apache.org Received: (qmail 49387 invoked by uid 99); 27 Jan 2011 15:55:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Jan 2011 15:55:49 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [129.240.10.58] (HELO mail-out2.uio.no) (129.240.10.58) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Jan 2011 15:55:42 +0000 Received: from mail-mx2.uio.no ([129.240.10.30]) by mail-out2.uio.no with esmtp (Exim 4.72) (envelope-from ) id 1PiUBs-0007jl-EQ for connectors-user@incubator.apache.org; Thu, 27 Jan 2011 16:55:20 +0100 Received: from hoppalong.uio.no ([129.240.93.30]) by mail-mx2.uio.no with esmtpsa (TLSv1:CAMELLIA256-SHA:256) user erlendfg (Exim 4.72) (envelope-from ) id 1PiUBr-0000Ky-T3 for connectors-user@incubator.apache.org; Thu, 27 Jan 2011 16:55:20 +0100 Message-ID: <4D419567.5050505@usit.uio.no> Date: Thu, 27 Jan 2011 16:55:19 +0100 From: =?ISO-8859-1?Q?Erlend_Gar=E5sen?= User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101207 Thunderbird/3.1.7 MIME-Version: 1.0 To: connectors-user@incubator.apache.org Subject: Re: Web crawler does not follow the robots meta tag rules References: <4D418975.5090901@usit.uio.no> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-UiO-Ratelimit-Test: rcpts/h 4 msgs/h 4 sum rcpts/h 4 sum msgs/h 4 total rcpts 2637 max rcpts/h 21 ratelimit 0 X-UiO-Spam-info: not spam, SpamAssassin (score=-5.0, required=5.0, autolearn=disabled, T_RP_MATCHES_RCVD=-0.01,UIO_MAIL_IS_INTERNAL=-5, uiobl=NO, uiouri=NO) X-UiO-Scanned: 76546E60EDBE1398A4754388F5CFB96EA26F1D93 X-UiO-SPAM-Test: remote_host: 129.240.93.30 spam_score: -49 maxlevel 80 minaction 2 bait 0 mail/h: 4 total 1345 max/h 15 blacklist 0 greylist 0 ratelimit 0 Thanks for your reply. OK, now I got two home lessons: - Create a Jira issue about this - Explain how it is possible to use ExtractingRequestHandler with Solr 1.4.1 by copying jars etc. BTW, I just figured out that Tika parses all the meta tag information, so I can rewrite the ExtractingRequestHandler classes in order to skip files with these meta directives. The following was included into my index last time i started the ManifoldCF job: robots noindex,nofollow I have already rewritten some of these classes in order to implement language detection, so it seems that we can implement all the functionality we need by using ManifoldCF. :) Erlend On 27.01.11 16.37, Karl Wright wrote: > There's also ordering; the meta tag must precede all links on the page > that you don't want the crawler to follow. Hope this is OK. > > Karl > > On Thu, Jan 27, 2011 at 10:16 AM, Karl Wright wrote: >> Sure, please open a ticket. >> Interpreting the tag should not be difficult. The main issues will be >> around noting the crawler's decision to skip documents or content in >> the activities history. And, of course, this will not be available in >> the ManifoldCF-0.1-incubating release. >> >> Please specify what variants of the tag you think should be supported, >> and if supported, how you think it should work. For example, >> including "nofollow" does not usually block crawlers from reaching >> your linked documents from other directions; if you want that >> functionality, you probably won't find that anywhere. This is why >> most people use robots.txt rather than the meta tag. >> >> Karl >> >> >> On Thu, Jan 27, 2011 at 10:04 AM, Erlend Gar�sen >> wrote: >>> >>> I just figured out that the web crawler does not follow the rules defined by >>> the robots meta tag. I created a document with the following tag: >>> >>> >>> This document has also a link to another document in order to test the >>> "nofollow" rule, but both documents were fetched and indexed by Solr. >>> >>> Should I open a Jira issue about this? I hope it's easy to rewrite the >>> crawler in order to add this functionality since this is a blocker for us. >>> >>> Erlend >>> >>> -- >>> Erlend Gar�sen >>> Center for Information Technology Services >>> University of Oslo >>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >>> >> -- Erlend Gar�sen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050