Return-Path: Delivered-To: apmail-lucene-nutch-user-archive@www.apache.org Received: (qmail 30964 invoked from network); 11 Sep 2009 17:19:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Sep 2009 17:19:01 -0000 Received: (qmail 58014 invoked by uid 500); 11 Sep 2009 17:19:00 -0000 Delivered-To: apmail-lucene-nutch-user-archive@lucene.apache.org Received: (qmail 57947 invoked by uid 500); 11 Sep 2009 17:19:00 -0000 Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-user@lucene.apache.org Delivered-To: mailing list nutch-user@lucene.apache.org Received: (qmail 57937 invoked by uid 99); 11 Sep 2009 17:19:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Sep 2009 17:19:00 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [64.34.111.254] (HELO barmail1.idig.net) (64.34.111.254) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Sep 2009 17:18:50 +0000 Received: from cpweb10.idig.net (cpweb10.idig.net [65.39.182.10]) by barmail1.idig.net (Spam & Virus Firewall) with ESMTP id ECA18306EE6A for ; Fri, 11 Sep 2009 10:18:27 -0700 (PDT) Received: from cpweb10.idig.net (cpweb10.idig.net [65.39.182.10]) by barmail1.idig.net with ESMTP id 0byTAtI9yECOGTtJ for ; Fri, 11 Sep 2009 10:18:27 -0700 (PDT) Received: from efendi-140-187.cust.b2b2c.ca ([72.10.140.187] helo=FuadPC) by cpweb10.idig.net with esmtp (Exim 4.69) (envelope-from ) id 1Mm9lR-0004jL-2I for nutch-user@lucene.apache.org; Fri, 11 Sep 2009 10:18:26 -0700 From: "Fuad Efendi" To: References: <95df48c0909110230r8cd7097mbbdc2cc124193128@mail.gmail.com> <95df48c0909111005i744b0f56x45b44475d8431fe0@mail.gmail.com> In-Reply-To: <95df48c0909111005i744b0f56x45b44475d8431fe0@mail.gmail.com> Subject: RE: Ignoring Robots.txt Date: Fri, 11 Sep 2009 13:18:16 -0400 Message-ID: <084901ca3303$dc198960$944c9c20$@ca> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AcozAhbclwaM4lB+T8OfZIxLZP1zYgAANrWA Content-Language: en-ca X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - cpweb10.idig.net X-AntiAbuse: Original Domain - lucene.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - efendi.ca X-Source: X-Source-Args: X-Source-Dir: X-Virus-Checked: Checked by ClamAV on apache.org > > My sysadm refuses to change the robots.txt citing the following reason: > > The moment he allows a specific agent, a lot of crawlers impersonate > as that user agent and tries to crawl that site. Extremely strange thoughts of some smart sys-minds... If crawler wants impersonate... it will, and it will ignore robots.txt, and sysadmin may ban such IP... I don't know any such public crawler except some desktop based download agents such as WebCEO or Teleport or even IE and Firefox... No way, Nutch must follow robots.txt.