Return-Path: Delivered-To: apmail-incubator-nutch-user-archive@www.apache.org Received: (qmail 92793 invoked from network); 8 May 2005 23:53:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 8 May 2005 23:53:07 -0000 Received: (qmail 8851 invoked by uid 500); 8 May 2005 23:56:08 -0000 Mailing-List: contact nutch-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-user@incubator.apache.org Delivered-To: mailing list nutch-user@incubator.apache.org Received: (qmail 8836 invoked by uid 99); 8 May 2005 23:56:08 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from ylpvm43-ext.prodigy.net (HELO ylpvm43.prodigy.net) (207.115.57.74) by apache.org (qpsmtpd/0.28) with ESMTP; Sun, 08 May 2005 16:56:08 -0700 Received: from pimout5-ext.prodigy.net (pimout5-ext.prodigy.net [207.115.63.73]) by ylpvm43.prodigy.net (8.12.10 outbound/8.12.10) with ESMTP id j48Nr89Q030376 for ; Sun, 8 May 2005 19:53:08 -0400 X-ORBL: [69.109.196.4] Received: from [192.168.0.3] (adsl-69-109-196-4.dsl.pltn13.pacbell.net [69.109.196.4]) by pimout5-ext.prodigy.net (8.12.10 milter /8.12.10) with ESMTP id j48NqwFh211134 for ; Sun, 8 May 2005 19:53:00 -0400 Mime-Version: 1.0 (Apple Message framework v619.2) Content-Transfer-Encoding: 7bit Message-Id: <570aad929878fc181c8536e4e9df85aa@tsw.berkeley.edu> Content-Type: text/plain; charset=US-ASCII; format=flowed To: nutch-user@incubator.apache.org From: Lucas Rockwell Subject: Need help with URL regex Date: Sun, 8 May 2005 16:54:28 -0700 X-Mailer: Apple Mail (2.619.2) X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Hi all, I have look in the archive and have followed the instructions in the tutorial and I am still having problems limiting nutch to just my site. For instance, the tutorial reads: 2. Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.org domain, the line should read: +^http://([a-z0-9]*\.)*nutch.org/ But when I test the above regex according to a comment in the archives on April 16 using: cat file-with-test-urls | nutch net/nutch/net/RegexURLFilter I get this for the output: +# skip URLs containing certain characters as probable queries, etc. --[?*!@=] - +# limit to org site only -+^http://([a-z0-9]*\.)*nutch.org/ - +# do not accept anything else ++. So, according to to the filter test, the regex in the tutorial does not work. Also, when I use Doug's example from another email (+^http://www.cs.princeton.edu/(people/(grad|fac)\.php)?$) I also get the "-" sign when I run the test. Also, the "-[?*!@=]" also gets a "-" sign... So, can anyone out there give me the exact syntax so that nutch will *only* crawl the domain (and subdomain(s)) for the site I want to crawl? Many thanks. -lucas