Hi Mark,

Yes, at least the materials I see online say that this is the case.  But I don't know exactly how.

For the purposes of the File System Output Connector, it doesn't matter, since anyone can construct a site that does NOT redirect and still has the URL layout as you originally described.  So the problem has to be solved.

I can experiment with WGET here, to check out what its behavior might be, but not while I'm doing Windows stuff - so I thought you might be able to do that.


On Tue, Nov 19, 2013 at 5:52 PM, Mark Libucha <mlibucha@gmail.com> wrote:
So you're saying wget can be run in a mode whereby it follows the redirect to fetch the content but uses the original, pre-redirect url to create the directory to store the content?

On Tue, Nov 19, 2013 at 2:41 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Mark,

Yes, but I'm afraid we *can't* emulate the redirect behavior because that's an upstream connector choice.  WGet can operate in a mode where it uses the pre-redirect URL, and resolves conflicts nonetheless.  How does it do it?


On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <mlibucha@gmail.com> wrote:
wget -x uses the redirect url as the basis for the path it creates.

So, if http://mysite/news returns a 302 redirecting to http://mysite/news/index.html, wget saves as:


MCF, on the other hand, saves as:



On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com> wrote:
Hi Mark,

The filesystem connector is supposed to emulate WGET behavior.  What does WGET do in this case?


On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <mlibucha@gmail.com> wrote:
Noticed this problem while crawling a web site and saving to the file system with the FileSystem output connector.

Let's say the website defines a URL like this:


That URI actually gets mapped to a file on the web server, say http://mysite/news/index.html, but the http://mysite/news URI does exist and gets sent as the documentURI to addOrReplaceDocument().

MCF's FileSystem connector gets the http://mysite/news URL and creates a directory for saving that content that looks like this http/mysite/news, where news is a file.

But then if the site also defines a URL like this http://mysite/news/local/today.html, MCF's FileSystem connector fails trying to create the directory http/mysite/news/local because part of it, http/mysite/news, already exists as a file.

Of course, if the URIs are crawled in the reverse order, the file can't be created because a directory already exists with that name.

Make sense?

The real killer is that when this happen it's fatal to the job. That is, it doesn't just fail to get that one URL, the connector returns a fatal error and the crawl is stopped.