manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Possible bug in seeds list (web connector)
Date Fri, 16 Mar 2012 10:51:26 GMT
"Do you agree that a well-formed URL is what java.net.URL will accept
in the constructor's argument? Then www.example.org will fail, but
http://www.example.org (without a trailing slash) will pass."

I might even go a bit further.  See the following code in:
WebcrawlerConnector:  protected String makeDocumentIdentifier(String
parentIdentifier, String rawURL, DocumentURLFilter filter)

Thanks!
Karl



On Fri, Mar 16, 2012 at 5:52 AM, Erlend Garåsen <e.f.garasen@usit.uio.no> wrote:
> On 15.03.12 19.30, Karl Wright wrote:
>>
>> A seed can be a specific html file so complaining about a trailing
>> slash would make that not work.  For example:
>>
>> http://hello.world.com/startpage.html
>
>
> I think I was a little bit unclear in my recent email. By a trailing slash,
> I was thinking more about the domain name itself, e.g. www.example.org/.
>
> I will create a Jira ticket now, but I will only focus about well-formed
> URLs in the seeds list.
>
> Do you agree that a well-formed URL is what java.net.URL will accept in the
> constructor's argument? Then www.example.org will fail, but
> http://www.example.org (without a trailing slash) will pass.
>
>
> Erlend
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message