manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend GarĂ¥sen <e.f.gara...@usit.uio.no>
Subject Re: Crawling just one particular page from a host
Date Tue, 14 May 2013 12:06:19 GMT
On 14.05.13 13.49, Karl Wright wrote:
> Hi Erlend,
>
> "Hosts matching seeds" means that if the domain (in this case
> www.ibsen.uio.no <http://www.ibsen.uio.no>) is mentioned in a seed, a
> page with the same domain will be included in the crawl if there is
> nothing else that excludes it.  So it sounds like it is working as designed.

Yes, you are right. I'm just trying to find a simple way to crawl just 
the starting page of a host and nothing else, i.e.:
http://www.ibsen.uio.no/forside.xhtml

I tried to place this in the include in crawl box:
http://www\.ibsen\.uio\.no/forside\.xhtml$

Still it will include everything else from that host unless I write a 
lot of exclude reg exp rules.

Erlend

-- 
Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message