nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Holt <mh...@redhat.com>
Subject Re: Recrawling question
Date Tue, 06 Jun 2006 19:56:30 GMT
The recrawl worked this time, and I recrawled the entire db using the 
-adddays argument (in my case ./recrawl crawl 10 31). However, it didn't 
find a newly created page.

If I delete the database and do the initial crawl over again, the new 
page is found. Any idea what I'm doing wrong or why it isn't finding it?

Thanks!
Matt

Matthew Holt wrote:

> Stefan,
>  Thanks a bunch! I see what you mean..
> matt
>
> Stefan Neufeind wrote:
>
>> Matthew Holt wrote:
>>  
>>
>>> Hi all,
>>>  I have already successfuly indexed all the files on my domain only (as
>>> specified in the conf/crawl-urlfilter.txt file).
>>>
>>> Now when I use the below script (./recrawl crawl 10 31) to recrawl the
>>> domain, it begins indexing pages off of my domain (such as wikipedia,
>>> etc). How do I prevent this? Thanks!
>>>   
>>
>>
>> Hi Matt,
>>
>> have a look at regex-urlfilter. "crawl" is special in some ways.
>> Actually it's "shortcut" for several steps. And it has a special
>> urlfilter-file. But if you do it in several steps that urlfilter-file is
>> no longer used.
>>
>>
>> Regards,
>> Stefan
>>
>>  
>>
>

Mime
View raw message