Return-Path: Delivered-To: apmail-lucene-nutch-user-archive@www.apache.org Received: (qmail 27866 invoked from network); 27 Sep 2009 06:21:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 27 Sep 2009 06:21:03 -0000 Received: (qmail 96723 invoked by uid 500); 27 Sep 2009 06:21:01 -0000 Delivered-To: apmail-lucene-nutch-user-archive@lucene.apache.org Received: (qmail 96662 invoked by uid 500); 27 Sep 2009 06:21:01 -0000 Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-user@lucene.apache.org Delivered-To: mailing list nutch-user@lucene.apache.org Received: (qmail 96652 invoked by uid 99); 27 Sep 2009 06:21:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 27 Sep 2009 06:21:01 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of claus2046@gmail.com designates 209.85.222.173 as permitted sender) Received: from [209.85.222.173] (HELO mail-pz0-f173.google.com) (209.85.222.173) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 27 Sep 2009 06:20:51 +0000 Received: by pzk3 with SMTP id 3so23861pzk.20 for ; Sat, 26 Sep 2009 23:20:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:references :in-reply-to:subject:date:message-id:mime-version:content-type :content-transfer-encoding:x-mailer:thread-index:content-language; bh=N/EWveGjvUtQylSoOeXXGC202BhNIQVCE+EuQXXx2rw=; b=REwcJAz44oh+sBTYPCoWGd+SEfRM64ig5p9DAcUoISlrHJa3gHORgatgXdBopIdosQ zc87WlhvymJcLZ+Du1BSHJfT5UM/+cA/2MrlTbSRnXQFdN12zjzmah2A+2zxH4ZugLpa MwtKJ6HG3IS8xQgIBkMBNwzg6dsOQ5xZZ8Q+g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:references:in-reply-to:subject:date:message-id:mime-version :content-type:content-transfer-encoding:x-mailer:thread-index :content-language; b=WkorJ5fr0T4fu5L3Zch8lvuqvC9k2S8/50hxy+QlRpA6DgrI/anepOpzJv8n/qvIio EHgYAPNBvE73SInJ8Kr2WM+lPHMwiu/QrC8yk0JXvpUYBjvh6syqNIUyUygCcExXyczZ u92xmW394Jww6MpjaGjEwN7ZNGvET+6rgreyM= Received: by 10.114.30.4 with SMTP id d4mr3139348wad.49.1254032431597; Sat, 26 Sep 2009 23:20:31 -0700 (PDT) Received: from ShawnDesktop ([58.33.164.218]) by mx.google.com with ESMTPS id 23sm970847pxi.5.2009.09.26.23.20.27 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 26 Sep 2009 23:20:30 -0700 (PDT) From: Shawn Young To: , References: <000001ca3ee3$5885ea30$0991be90$@com> <1254015382.4980.2.camel@dadih01> In-Reply-To: <1254015382.4980.2.camel@dadih01> Subject: RE: How can nutch crawl the content of a dynamic url with a query string? Date: Sun, 27 Sep 2009 14:20:24 +0800 Message-ID: <000101ca3f3a$9d22b130$d7681390$@com> MIME-Version: 1.0 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Aco/EwJcsdl6Ip8hQEKRa0GkkaHD1gAJ1KTg Content-Language: zh-cn X-Virus-Checked: Checked by ClamAV on apache.org Awesome, it's works! Thanks a lot, Kevin. -----Original Message----- From: kevin chen [mailto:kevinchen@bdsing.com]=20 Sent: 2009=C4=EA9=D4=C227=C8=D5 9:36 To: nutch-user@lucene.apache.org Subject: Re: How can nutch crawl the content of a dynamic url with a = query string? By default, nutch skips URLs containing certain characters. To change it, open regex-urlfilter.txt, comment out the following line. # skip URLs containing certain characters as probable queries, etc. -[?*!@=3D] On Sun, 2009-09-27 at 03:55 +0800, Shawn Young wrote: > Hi all, >=20 > I have a question, if a web page's url likes > http://www.test.com/test.php?gid=3D1111111 ,how can nutch crawl its = content? > I've had a try, but it seems that nutch ignores the query string > 'gid=3D1111111' of the url. >=20 > Can someone helps me? > Thanks. >=20