nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fa...@butterflycluster.net
Subject Re: How to crawl pagination in sequence
Date Wed, 09 Sep 2009 06:16:41 GMT
i dont know; look around the httpclient code.

but you probably want to make sure its a client session issue first.

i could be wrong.

> I tried running with one thread, still same results.
> Any hint on how do we make Nutch aware of session cookies
>
> ---
> Thanks/Regards,
> Parvez
>
>
> On Wed, Sep 9, 2009 at 12:51 AM, <fadzi@butterflycluster.net> wrote:
>
>> how many threads are you running at?
>>
>> nutch doesnt know about sessions;
>>
>> you might have to do something like fetching one thread at a time but
>> thats slow.
>>
>> or maybe make nutch aware of session cookies.
>>
>> > I am crawling at depth 40 as there are 40 pages in the pagination.
>> >
>> > It works fine till the first 6 pages and after that it goes to the 7th
>> > page,
>> > but looks like its different session and hence the pagination wont
>> work.
>> >
>> > I mean if you you directly hit page 7, using the URL, the pagination
>> wont
>> > work and will return empty set.
>> >
>> > But if you go in the sequence in the same session the pagination
>> works.
>> >
>> >
>> > ---
>> > Thanks/Regards,
>> > Parvez
>> >
>> >
>> > On Wed, Sep 9, 2009 at 12:15 AM, <fadzi@butterflycluster.net> wrote:
>> >
>> >> could be tricky from what i've seen;
>> >>
>> >> theres limits on how many times you can hit one host/ip;
>> >>
>> >> also what depth you are crawling at may come to play in your case
>> (which
>> >> is probably what you want to look at in this case).
>> >>
>> >>
>> >> > Any hint to increase the session time of the Nutch crawl thread.
>> >> > I tried crawling with one thread, still no luck.
>> >> >
>> >> > ----
>> >> > Thanks/Regards,
>> >> > Parvez
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <parvez@gmail.com>
>> >> wrote:
>> >> >
>> >> >> I have a paginated pages, which will only work if its crawled in
a
>> >> given
>> >> >> sequence, and in the same session.
>> >> >>
>> >> >> For example first URL is
>> >> >>
>> >> >> http://www.myhost.com/?page_number=1
>> >> >> http://www.myhost.com/?page_number=2
>> >> >> http://www.myhost.com/?page_number=3
>> >> >>
>> >> >> The first page has link to second page.
>> >> >> Second page has link to first and second page.
>> >> >> Third page has link to third and second page.
>> >> >> So On...
>> >> >>
>> >> >> Nutch is able to crawl the the first 6 pages, but beyond that it
>> is
>> >> not
>> >> >> able to crawl or is getting empty result.
>> >> >>
>> >> >> If I manually click through the pagination, in a browser, I can
>> reach
>> >> >> till
>> >> >> the end with no problem.
>> >> >>
>> >> >> Is the Nutch Crawl Session timing out? How do we increase it.
>> >> >>
>> >> >> I tried crawling with on thread but still same result.
>> >> >>
>> >> >> Any suggestion ?
>> >> >>
>> >> >> ---
>> >> >> Thanks/Regards,
>> >> >> Parvez
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>



Mime
View raw message