nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Parvez <par...@gmail.com>
Subject Re: How to crawl pagination in sequence
Date Wed, 09 Sep 2009 05:37:55 GMT
I am crawling at depth 40 as there are 40 pages in the pagination.

It works fine till the first 6 pages and after that it goes to the 7th page,
but looks like its different session and hence the pagination wont work.

I mean if you you directly hit page 7, using the URL, the pagination wont
work and will return empty set.

But if you go in the sequence in the same session the pagination works.


---
Thanks/Regards,
Parvez


On Wed, Sep 9, 2009 at 12:15 AM, <fadzi@butterflycluster.net> wrote:

> could be tricky from what i've seen;
>
> theres limits on how many times you can hit one host/ip;
>
> also what depth you are crawling at may come to play in your case (which
> is probably what you want to look at in this case).
>
>
> > Any hint to increase the session time of the Nutch crawl thread.
> > I tried crawling with one thread, still no luck.
> >
> > ----
> > Thanks/Regards,
> > Parvez
> >
> >
> >
> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <parvez@gmail.com> wrote:
> >
> >> I have a paginated pages, which will only work if its crawled in a given
> >> sequence, and in the same session.
> >>
> >> For example first URL is
> >>
> >> http://www.myhost.com/?page_number=1
> >> http://www.myhost.com/?page_number=2
> >> http://www.myhost.com/?page_number=3
> >>
> >> The first page has link to second page.
> >> Second page has link to first and second page.
> >> Third page has link to third and second page.
> >> So On...
> >>
> >> Nutch is able to crawl the the first 6 pages, but beyond that it is not
> >> able to crawl or is getting empty result.
> >>
> >> If I manually click through the pagination, in a browser, I can reach
> >> till
> >> the end with no problem.
> >>
> >> Is the Nutch Crawl Session timing out? How do we increase it.
> >>
> >> I tried crawling with on thread but still same result.
> >>
> >> Any suggestion ?
> >>
> >> ---
> >> Thanks/Regards,
> >> Parvez
> >>
> >>
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message