Return-Path: Delivered-To: apmail-lucene-nutch-user-archive@www.apache.org Received: (qmail 25263 invoked from network); 9 Sep 2009 06:13:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Sep 2009 06:13:27 -0000 Received: (qmail 80207 invoked by uid 500); 9 Sep 2009 06:13:26 -0000 Delivered-To: apmail-lucene-nutch-user-archive@lucene.apache.org Received: (qmail 80131 invoked by uid 500); 9 Sep 2009 06:13:26 -0000 Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-user@lucene.apache.org Delivered-To: mailing list nutch-user@lucene.apache.org Received: (qmail 80121 invoked by uid 99); 9 Sep 2009 06:13:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2009 06:13:26 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of parvez@gmail.com designates 209.85.211.188 as permitted sender) Received: from [209.85.211.188] (HELO mail-yw0-f188.google.com) (209.85.211.188) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2009 06:13:16 +0000 Received: by ywh26 with SMTP id 26so6002946ywh.5 for ; Tue, 08 Sep 2009 23:12:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=2j3TEx6rwN/a6QSMh6FQU9KtXQ40IeSZ8YV7ga+qvSc=; b=E4fwIYVyiJQA7pV3JO9sJ6mvK+2WtPzXdE4zJdX3i10/WU46ZQpRuonC7B0kYaPbyq f0gXpG4ftEly6DscSVJ9ys+9eyh/7bursFQ2YhsKAevSRyNjwgSgoauZ0loaa8R1f7Ir nrxCNf+doxI902FEZsHMK8AHUacR+YBl3b3Tk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=elab7LurDqOf87+U1ZoqSe+F8Su0jhn5wi+YdcfF8tvOcO9ymZnC4v7cYajfsGU0g9 qYanu2t03BMCis4vgfPz8Zvco8Ayqcv0hmtsG7l3Dqa4JBAXslTPj2/S2pfWkHui+zDg 6wXCbQdPGch8TPZ3b4AP7TC6J8/AlIWCQOAeU= MIME-Version: 1.0 Received: by 10.101.43.12 with SMTP id v12mr14390557anj.90.1252476775083; Tue, 08 Sep 2009 23:12:55 -0700 (PDT) In-Reply-To: <53724.203.35.135.133.1252475482.squirrel@www.butterflycluster.com> References: <60682.203.35.135.133.1252473338.squirrel@www.butterflycluster.com> <53724.203.35.135.133.1252475482.squirrel@www.butterflycluster.com> From: Mohamed Parvez Date: Wed, 9 Sep 2009 01:12:35 -0500 Message-ID: Subject: Re: How to crawl pagination in sequence To: nutch-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001636ed722944af4504731ef990 X-Virus-Checked: Checked by ClamAV on apache.org --001636ed722944af4504731ef990 Content-Type: text/plain; charset=ISO-8859-1 I tried running with one thread, still same results. Any hint on how do we make Nutch aware of session cookies --- Thanks/Regards, Parvez On Wed, Sep 9, 2009 at 12:51 AM, wrote: > how many threads are you running at? > > nutch doesnt know about sessions; > > you might have to do something like fetching one thread at a time but > thats slow. > > or maybe make nutch aware of session cookies. > > > I am crawling at depth 40 as there are 40 pages in the pagination. > > > > It works fine till the first 6 pages and after that it goes to the 7th > > page, > > but looks like its different session and hence the pagination wont work. > > > > I mean if you you directly hit page 7, using the URL, the pagination wont > > work and will return empty set. > > > > But if you go in the sequence in the same session the pagination works. > > > > > > --- > > Thanks/Regards, > > Parvez > > > > > > On Wed, Sep 9, 2009 at 12:15 AM, wrote: > > > >> could be tricky from what i've seen; > >> > >> theres limits on how many times you can hit one host/ip; > >> > >> also what depth you are crawling at may come to play in your case (which > >> is probably what you want to look at in this case). > >> > >> > >> > Any hint to increase the session time of the Nutch crawl thread. > >> > I tried crawling with one thread, still no luck. > >> > > >> > ---- > >> > Thanks/Regards, > >> > Parvez > >> > > >> > > >> > > >> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez > >> wrote: > >> > > >> >> I have a paginated pages, which will only work if its crawled in a > >> given > >> >> sequence, and in the same session. > >> >> > >> >> For example first URL is > >> >> > >> >> http://www.myhost.com/?page_number=1 > >> >> http://www.myhost.com/?page_number=2 > >> >> http://www.myhost.com/?page_number=3 > >> >> > >> >> The first page has link to second page. > >> >> Second page has link to first and second page. > >> >> Third page has link to third and second page. > >> >> So On... > >> >> > >> >> Nutch is able to crawl the the first 6 pages, but beyond that it is > >> not > >> >> able to crawl or is getting empty result. > >> >> > >> >> If I manually click through the pagination, in a browser, I can reach > >> >> till > >> >> the end with no problem. > >> >> > >> >> Is the Nutch Crawl Session timing out? How do we increase it. > >> >> > >> >> I tried crawling with on thread but still same result. > >> >> > >> >> Any suggestion ? > >> >> > >> >> --- > >> >> Thanks/Regards, > >> >> Parvez > >> >> > >> >> > >> > > >> > >> > >> > > > > > --001636ed722944af4504731ef990--