nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Parvez <par...@gmail.com>
Subject Re: How to crawl pagination in sequence
Date Wed, 09 Sep 2009 06:12:35 GMT
I tried running with one thread, still same results.
Any hint on how do we make Nutch aware of session cookies

---
Thanks/Regards,
Parvez


On Wed, Sep 9, 2009 at 12:51 AM, <fadzi@butterflycluster.net> wrote:

> how many threads are you running at?
>
> nutch doesnt know about sessions;
>
> you might have to do something like fetching one thread at a time but
> thats slow.
>
> or maybe make nutch aware of session cookies.
>
> > I am crawling at depth 40 as there are 40 pages in the pagination.
> >
> > It works fine till the first 6 pages and after that it goes to the 7th
> > page,
> > but looks like its different session and hence the pagination wont work.
> >
> > I mean if you you directly hit page 7, using the URL, the pagination wont
> > work and will return empty set.
> >
> > But if you go in the sequence in the same session the pagination works.
> >
> >
> > ---
> > Thanks/Regards,
> > Parvez
> >
> >
> > On Wed, Sep 9, 2009 at 12:15 AM, <fadzi@butterflycluster.net> wrote:
> >
> >> could be tricky from what i've seen;
> >>
> >> theres limits on how many times you can hit one host/ip;
> >>
> >> also what depth you are crawling at may come to play in your case (which
> >> is probably what you want to look at in this case).
> >>
> >>
> >> > Any hint to increase the session time of the Nutch crawl thread.
> >> > I tried crawling with one thread, still no luck.
> >> >
> >> > ----
> >> > Thanks/Regards,
> >> > Parvez
> >> >
> >> >
> >> >
> >> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <parvez@gmail.com>
> >> wrote:
> >> >
> >> >> I have a paginated pages, which will only work if its crawled in a
> >> given
> >> >> sequence, and in the same session.
> >> >>
> >> >> For example first URL is
> >> >>
> >> >> http://www.myhost.com/?page_number=1
> >> >> http://www.myhost.com/?page_number=2
> >> >> http://www.myhost.com/?page_number=3
> >> >>
> >> >> The first page has link to second page.
> >> >> Second page has link to first and second page.
> >> >> Third page has link to third and second page.
> >> >> So On...
> >> >>
> >> >> Nutch is able to crawl the the first 6 pages, but beyond that it is
> >> not
> >> >> able to crawl or is getting empty result.
> >> >>
> >> >> If I manually click through the pagination, in a browser, I can reach
> >> >> till
> >> >> the end with no problem.
> >> >>
> >> >> Is the Nutch Crawl Session timing out? How do we increase it.
> >> >>
> >> >> I tried crawling with on thread but still same result.
> >> >>
> >> >> Any suggestion ?
> >> >>
> >> >> ---
> >> >> Thanks/Regards,
> >> >> Parvez
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message