Return-Path: Delivered-To: apmail-lucene-nutch-user-archive@www.apache.org Received: (qmail 92220 invoked from network); 3 Sep 2009 18:27:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Sep 2009 18:27:37 -0000 Received: (qmail 10460 invoked by uid 500); 3 Sep 2009 18:27:36 -0000 Delivered-To: apmail-lucene-nutch-user-archive@lucene.apache.org Received: (qmail 10418 invoked by uid 500); 3 Sep 2009 18:27:36 -0000 Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-user@lucene.apache.org Delivered-To: mailing list nutch-user@lucene.apache.org Received: (qmail 10408 invoked by uid 99); 3 Sep 2009 18:27:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Sep 2009 18:27:36 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of parvez@gmail.com designates 209.85.132.248 as permitted sender) Received: from [209.85.132.248] (HELO an-out-0708.google.com) (209.85.132.248) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Sep 2009 18:27:28 +0000 Received: by an-out-0708.google.com with SMTP id b2so59074ana.5 for ; Thu, 03 Sep 2009 11:27:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:from:date:message-id :subject:to:content-type; bh=Cso6z4PcvAJA4XnUfHUUIgSf1DJIkL3CPpUUKidJH4o=; b=RrI5Nk3TFcfUIrSyrRRr7L4BKekJyL4Rq9wofeW5X6dwMbSMtbcNQHiZwi9IsgK4V9 RxWvCzl1NPTqsM5wfJZJCC7eKYBFGDOGLPLkKzj6QxxIoqCbdQngLOCc2ouRj+p4JqCa qht4K6lQNxQefxcV9qv0yRZO1FXg19WNJmbkc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:from:date:message-id:subject:to:content-type; b=QAt9DrZ0/TjckvcZOly9XfAnI3aMFNu4eoSSdiUw1vUJTrpcvWz7rCeJMsA++zzZ8j zoCmLaMsh3sTEiBXwVITeBE51dlBgfWdvZLwMG7ZwWcrmMu9sZAGZjpb55PHwBJp58df XxempPtytkdL840MtZ3R/7sg0q9C5GGn8IogU= MIME-Version: 1.0 Received: by 10.101.176.38 with SMTP id d38mr11185732anp.12.1252002427150; Thu, 03 Sep 2009 11:27:07 -0700 (PDT) From: Mohamed Parvez Date: Thu, 3 Sep 2009 13:26:47 -0500 Message-ID: Subject: URL with Space To: nutch-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001636c927f3eda1900472b087b2 X-Virus-Checked: Checked by ClamAV on apache.org --001636c927f3eda1900472b087b2 Content-Type: text/plain; charset=ISO-8859-1 I am trying to crawl a URL that has space in it. NUTCH-661 suggests that his can be fixed with a urlnormalizer plugin. https://issues.apache.org/jira/browse/NUTCH-661 I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic)) and I put the below rule in the conf/regex-normalize.xml file \s %20 But still the URL with space is not getting crawled. Any hint, as to, what needs to be added in the the conf/regex-normalize.xml file, to make Nutch crawl URLs with spaces. ------- Thanks/Regards, Parvez --001636c927f3eda1900472b087b2--