manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend GarĂ¥sen <e.f.gara...@usit.uio.no>
Subject Possible bug in seeds list (web connector)
Date Thu, 15 Mar 2012 17:47:34 GMT

If I add the following URL into my seeds list:
http://www.uio.no
and this into the "include in crawl" list:
http://www.uio.no/.*
the job will just end shortly after it starts without fetching anything 
at all. If I add the missing trailing slash into my seeds url list 
(http://www.uio.no/), it works as it should.

I also discovered another similar behaviour. If I add the following into 
my seeds list:
www.uio.no
select the "include only hosts matching seeds?" option and do not add 
anything into the "include in crawl", the same thing happen. No URLs 
will be fetched.

I suggest that we do something like this:
- A URL in the Java code will always start with "http(s)://www.myhost.com/
- If you fail to add the protocol or the trailing slash, it will be 
added automatically instead of returning an error message.

By "in the Java code", I mean that it should automatically be formatted 
like this before we do a regular expression match.

Erlend

-- 
Erlend GarĂ¥sen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message