Return-Path: Delivered-To: apmail-lucene-nutch-user-archive@www.apache.org Received: (qmail 1251 invoked from network); 27 Sep 2009 01:37:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 27 Sep 2009 01:37:00 -0000 Received: (qmail 97475 invoked by uid 500); 27 Sep 2009 01:36:59 -0000 Delivered-To: apmail-lucene-nutch-user-archive@lucene.apache.org Received: (qmail 97434 invoked by uid 500); 27 Sep 2009 01:36:59 -0000 Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-user@lucene.apache.org Delivered-To: mailing list nutch-user@lucene.apache.org Received: (qmail 97424 invoked by uid 99); 27 Sep 2009 01:36:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 27 Sep 2009 01:36:59 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.208.4.194] (HELO mout.perfora.net) (74.208.4.194) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 27 Sep 2009 01:36:48 +0000 Received: from ca.test.dadi360.com (pool-173-52-157-13.nycmny.east.verizon.net [173.52.157.13]) by mrelay.perfora.net (node=mrus0) with ESMTP (Nemesis) id 0MehSi-1N2fzb0NKa-00OuCN; Sat, 26 Sep 2009 21:36:24 -0400 Subject: Re: How can nutch crawl the content of a dynamic url with a query string? From: kevin chen Reply-To: kevinchen@bdsing.com To: nutch-user@lucene.apache.org In-Reply-To: <000001ca3ee3$5885ea30$0991be90$@com> References: <000001ca3ee3$5885ea30$0991be90$@com> Content-Type: text/plain Date: Sat, 26 Sep 2009 21:36:22 -0400 Message-Id: <1254015382.4980.2.camel@dadih01> Mime-Version: 1.0 X-Mailer: Evolution 2.0.2 (2.0.2-35.el4) Content-Transfer-Encoding: 7bit X-Provags-ID: V01U2FsdGVkX1/Zjq7X2VCZHfOfiLh9/VVAhOMzAoYYLrwWAB5 63BwejuzbZgFDyD1HWoa7bmKHZ/MEmjI/n6492oyFx5AaoqR6c IiQn8qDtBvaPRP7a8BIA3UPGJT6hYZY X-Virus-Checked: Checked by ClamAV on apache.org By default, nutch skips URLs containing certain characters. To change it, open regex-urlfilter.txt, comment out the following line. # skip URLs containing certain characters as probable queries, etc. -[?*!@=] On Sun, 2009-09-27 at 03:55 +0800, Shawn Young wrote: > Hi all, > > I have a question, if a web page's url likes > http://www.test.com/test.php?gid=1111111 ,how can nutch crawl its content? > I've had a try, but it seems that nutch ignores the query string > 'gid=1111111' of the url. > > Can someone helps me? > Thanks. >