manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Possible bug in seeds list (web connector)
Date Thu, 15 Mar 2012 17:55:21 GMT
But this makes sense, actually.  The url "http://www.uio.no" does not
actually match the regexp "http://www.uio.no/.*", so it is ditched.

The proposal to silently modify the seed according to some criteria
makes me nervous.  I'd much rather the UI caught and complained about
seeds that were non-conforming than have something silent happen under
the covers.

Karl


On Thu, Mar 15, 2012 at 1:47 PM, Erlend GarĂ¥sen <e.f.garasen@usit.uio.no> wrote:
>
> If I add the following URL into my seeds list:
> http://www.uio.no
> and this into the "include in crawl" list:
> http://www.uio.no/.*
> the job will just end shortly after it starts without fetching anything at
> all. If I add the missing trailing slash into my seeds url list
> (http://www.uio.no/), it works as it should.
>
> I also discovered another similar behaviour. If I add the following into my
> seeds list:
> www.uio.no
> select the "include only hosts matching seeds?" option and do not add
> anything into the "include in crawl", the same thing happen. No URLs will be
> fetched.
>
> I suggest that we do something like this:
> - A URL in the Java code will always start with "http(s)://www.myhost.com/
> - If you fail to add the protocol or the trailing slash, it will be added
> automatically instead of returning an error message.
>
> By "in the Java code", I mean that it should automatically be formatted like
> this before we do a regular expression match.
>
> Erlend
>
> --
> Erlend GarĂ¥sen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message