So you're saying wget can be run in a mode whereby it follows the redirect to fetch the content but uses the original, pre-redirect url to create the directory to store the content?
On Tue, Nov 19, 2013 at 2:41 PM, Karl Wright <firstname.lastname@example.org> wrote:Hi Mark,Yes, but I'm afraid we *can't* emulate the redirect behavior because that's an upstream connector choice. WGet can operate in a mode where it uses the pre-redirect URL, and resolves conflicts nonetheless. How does it do it?
KarlOn Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <email@example.com> wrote:
MarkMCF, on the other hand, saves as:mysite/news/index.htmlwget -x uses the redirect url as the basis for the path it creates.So, if http://mysite/news returns a 302 redirecting to http://mysite/news/index.html, wget saves as:
On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <firstname.lastname@example.org> wrote:
Hi Mark,The filesystem connector is supposed to emulate WGET behavior. What does WGET do in this case?
KarlOn Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <email@example.com> wrote:
But then if the site also defines a URL like this http://mysite/news/local/today.html, MCF's FileSystem connector fails trying to create the directory http/mysite/news/local because part of it, http/mysite/news, already exists as a file.MCF's FileSystem connector gets the http://mysite/news URL and creates a directory for saving that content that looks like this http/mysite/news, where news is a file.That URI actually gets mapped to a file on the web server, say http://mysite/news/index.html, but the http://mysite/news URI does exist and gets sent as the documentURI to addOrReplaceDocument().http://mysite/newsNoticed this problem while crawling a web site and saving to the file system with the FileSystem output connector.Let's say the website defines a URL like this:
Of course, if the URIs are crawled in the reverse order, the file can't be created because a directory already exists with that name.Make sense?The real killer is that when this happen it's fatal to the job. That is, it doesn't just fail to get that one URL, the connector returns a fatal error and the crawl is stopped.Mark