nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Parvez <par...@gmail.com>
Subject Re: How to crawl pagination in sequence
Date Wed, 09 Sep 2009 05:09:51 GMT
Any hint to increase the session time of the Nutch crawl thread.
I tried crawling with one thread, still no luck.

----
Thanks/Regards,
Parvez



On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <parvez@gmail.com> wrote:

> I have a paginated pages, which will only work if its crawled in a given
> sequence, and in the same session.
>
> For example first URL is
>
> http://www.myhost.com/?page_number=1
> http://www.myhost.com/?page_number=2
> http://www.myhost.com/?page_number=3
>
> The first page has link to second page.
> Second page has link to first and second page.
> Third page has link to third and second page.
> So On...
>
> Nutch is able to crawl the the first 6 pages, but beyond that it is not
> able to crawl or is getting empty result.
>
> If I manually click through the pagination, in a browser, I can reach till
> the end with no problem.
>
> Is the Nutch Crawl Session timing out? How do we increase it.
>
> I tried crawling with on thread but still same result.
>
> Any suggestion ?
>
> ---
> Thanks/Regards,
> Parvez
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message