nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From fa...@butterflycluster.net
Subject Re: How to crawl pagination in sequence
Date Wed, 09 Sep 2009 05:15:38 GMT
could be tricky from what i've seen;

theres limits on how many times you can hit one host/ip;

also what depth you are crawling at may come to play in your case (which
is probably what you want to look at in this case).


> Any hint to increase the session time of the Nutch crawl thread.
> I tried crawling with one thread, still no luck.
>
> ----
> Thanks/Regards,
> Parvez
>
>
>
> On Tue, Sep 8, 2009 at 4:02 PM, Mohamed Parvez <parvez@gmail.com> wrote:
>
>> I have a paginated pages, which will only work if its crawled in a given
>> sequence, and in the same session.
>>
>> For example first URL is
>>
>> http://www.myhost.com/?page_number=1
>> http://www.myhost.com/?page_number=2
>> http://www.myhost.com/?page_number=3
>>
>> The first page has link to second page.
>> Second page has link to first and second page.
>> Third page has link to third and second page.
>> So On...
>>
>> Nutch is able to crawl the the first 6 pages, but beyond that it is not
>> able to crawl or is getting empty result.
>>
>> If I manually click through the pagination, in a browser, I can reach
>> till
>> the end with no problem.
>>
>> Is the Nutch Crawl Session timing out? How do we increase it.
>>
>> I tried crawling with on thread but still same result.
>>
>> Any suggestion ?
>>
>> ---
>> Thanks/Regards,
>> Parvez
>>
>>
>



Mime
View raw message