manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Libucha <mlibu...@gmail.com>
Subject Re: FileSystem connector path issue
Date Tue, 19 Nov 2013 22:52:48 GMT
So you're saying wget can be run in a mode whereby it follows the redirect
to fetch the content but uses the original, pre-redirect url to create the
directory to store the content?


On Tue, Nov 19, 2013 at 2:41 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Mark,
>
> Yes, but I'm afraid we *can't* emulate the redirect behavior because
> that's an upstream connector choice.  WGet can operate in a mode where it
> uses the pre-redirect URL, and resolves conflicts nonetheless.  How does it
> do it?
>
> Karl
>
>
>
> On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <mlibucha@gmail.com> wrote:
>
>> wget -x uses the redirect url as the basis for the path it creates.
>>
>> So, if http://mysite/news returns a 302 redirecting to
>> http://mysite/news/index.html, wget saves as:
>>
>> mysite/news/index.html
>>
>> MCF, on the other hand, saves as:
>>
>> http/mysite/news
>>
>> Mark
>>
>>
>> On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Mark,
>>>
>>> The filesystem connector is supposed to emulate WGET behavior.  What
>>> does WGET do in this case?
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <mlibucha@gmail.com>wrote:
>>>
>>>> Noticed this problem while crawling a web site and saving to the file
>>>> system with the FileSystem output connector.
>>>>
>>>> Let's say the website defines a URL like this:
>>>>
>>>> http://mysite/news
>>>>
>>>> That URI actually gets mapped to a file on the web server, say
>>>> http://mysite/news/index.html, but the http://mysite/news URI does
>>>> exist and gets sent as the documentURI to addOrReplaceDocument().
>>>>
>>>> MCF's FileSystem connector gets the http://mysite/news URL and creates
>>>> a directory for saving that content that looks like this http/mysite/news,
>>>> where news is a file.
>>>>
>>>> But then if the site also defines a URL like this
>>>> http://mysite/news/local/today.html, MCF's FileSystem connector fails
>>>> trying to create the directory http/mysite/news/local because part of it,
>>>> http/mysite/news, already exists as a file.
>>>>
>>>> Of course, if the URIs are crawled in the reverse order, the file can't
>>>> be created because a directory already exists with that name.
>>>>
>>>> Make sense?
>>>>
>>>> The real killer is that when this happen it's fatal to the job. That
>>>> is, it doesn't just fail to get that one URL, the connector returns a fatal
>>>> error and the crawl is stopped.
>>>>
>>>> Mark
>>>>
>>>>
>>>
>>
>

Mime
View raw message