nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Parvez <par...@gmail.com>
Subject How to crawl pagination in sequence
Date Tue, 08 Sep 2009 21:02:54 GMT
I have a paginated pages, which will only work if its crawled in a given
sequence, and in the same session.

For example first URL is

http://www.myhost.com/?page_number=1
http://www.myhost.com/?page_number=2
http://www.myhost.com/?page_number=3

The first page has link to second page.
Second page has link to first and second page.
Third page has link to third and second page.
So On...

Nutch is able to crawl the the first 6 pages, but beyond that it is not able
to crawl or is getting empty result.

If I manually click through the pagination, in a browser, I can reach till
the end with no problem.

Is the Nutch Crawl Session timing out? How do we increase it.

I tried crawling with on thread but still same result.

Any suggestion ?

---
Thanks/Regards,
Parvez

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message