manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: FileSystem connector path issue
Date Tue, 19 Nov 2013 22:15:13 GMT
Hi Mark,

The filesystem connector is supposed to emulate WGET behavior.  What does
WGET do in this case?

Karl



On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <mlibucha@gmail.com> wrote:

> Noticed this problem while crawling a web site and saving to the file
> system with the FileSystem output connector.
>
> Let's say the website defines a URL like this:
>
> http://mysite/news
>
> That URI actually gets mapped to a file on the web server, say
> http://mysite/news/index.html, but the http://mysite/news URI does exist
> and gets sent as the documentURI to addOrReplaceDocument().
>
> MCF's FileSystem connector gets the http://mysite/news URL and creates a
> directory for saving that content that looks like this http/mysite/news,
> where news is a file.
>
> But then if the site also defines a URL like this
> http://mysite/news/local/today.html, MCF's FileSystem connector fails
> trying to create the directory http/mysite/news/local because part of it,
> http/mysite/news, already exists as a file.
>
> Of course, if the URIs are crawled in the reverse order, the file can't be
> created because a directory already exists with that name.
>
> Make sense?
>
> The real killer is that when this happen it's fatal to the job. That is,
> it doesn't just fail to get that one URL, the connector returns a fatal
> error and the crawl is stopped.
>
> Mark
>
>

Mime
View raw message