Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@manifoldcf.apache.org
Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates
 209.85.128.51 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAMEqr29wknM7+4h4cymUAE-MTPdqLww7y8Q73GFgmow-dYqfGQ@mail.gmail.com>
References: 
 <CAMEqr2_uXMwmw0JLnPZBgF2zuLHg9iV71mssUbstYiT2Na_Jyg@mail.gmail.com>
	<CALUFAGDhyPej9D7k-2upk43yvw40bywe4Rc9N2PxPy+iQx+9Pg@mail.gmail.com>
	<CAMEqr29HN7opP=-+BKniBmJL6Y+1oU3=x-WcDbEp9oju7yV1qQ@mail.gmail.com>
	<CALUFAGB6Vp-q6DMMHVS-c-FWpPX9ZZY5MNF8vH1AN4vG_8akcA@mail.gmail.com>
	<CAMEqr29wknM7+4h4cymUAE-MTPdqLww7y8Q73GFgmow-dYqfGQ@mail.gmail.com>
Date: Tue, 19 Nov 2013 17:57:43 -0500
Message-ID: 
 <CALUFAGDCKtfuWKY57FPsqTPj97dNYFaPc4cSDznppSVutRNFDw@mail.gmail.com>
Subject: Re: FileSystem connector path issue
From: Karl Wright <daddywri@gmail.com>
To: "user@manifoldcf.apache.org" <user@manifoldcf.apache.org>
Content-Type: multipart/alternative; boundary=001a11c339e8a6606204eb8f972d

--001a11c339e8a6606204eb8f972d
Content-Type: text/plain; charset=ISO-8859-1

Hi Mark,

Yes, at least the materials I see online say that this is the case.  But I
don't know exactly how.

For the purposes of the File System Output Connector, it doesn't matter,
since anyone can construct a site that does NOT redirect and still has the
URL layout as you originally described.  So the problem has to be solved.

I can experiment with WGET here, to check out what its behavior might be,
but not while I'm doing Windows stuff - so I thought you might be able to
do that.

Thanks,
Karl


On Tue, Nov 19, 2013 at 5:52 PM, Mark Libucha <mlibucha@gmail.com> wrote:

> So you're saying wget can be run in a mode whereby it follows the redirect
> to fetch the content but uses the original, pre-redirect url to create the
> directory to store the content?
>
>
> On Tue, Nov 19, 2013 at 2:41 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Mark,
>>
>> Yes, but I'm afraid we *can't* emulate the redirect behavior because
>> that's an upstream connector choice.  WGet can operate in a mode where it
>> uses the pre-redirect URL, and resolves conflicts nonetheless.  How does it
>> do it?
>>
>> Karl
>>
>>
>>
>> On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha <mlibucha@gmail.com> wrote:
>>
>>> wget -x uses the redirect url as the basis for the path it creates.
>>>
>>> So, if http://mysite/news returns a 302 redirecting to
>>> http://mysite/news/index.html, wget saves as:
>>>
>>> mysite/news/index.html
>>>
>>> MCF, on the other hand, saves as:
>>>
>>> http/mysite/news
>>>
>>> Mark
>>>
>>>
>>> On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Mark,
>>>>
>>>> The filesystem connector is supposed to emulate WGET behavior.  What
>>>> does WGET do in this case?
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha <mlibucha@gmail.com>wrote:
>>>>
>>>>> Noticed this problem while crawling a web site and saving to the file
>>>>> system with the FileSystem output connector.
>>>>>
>>>>> Let's say the website defines a URL like this:
>>>>>
>>>>> http://mysite/news
>>>>>
>>>>> That URI actually gets mapped to a file on the web server, say
>>>>> http://mysite/news/index.html, but the http://mysite/news URI does
>>>>> exist and gets sent as the documentURI to addOrReplaceDocument().
>>>>>
>>>>> MCF's FileSystem connector gets the http://mysite/news URL and
>>>>> creates a directory for saving that content that looks like this
>>>>> http/mysite/news, where news is a file.
>>>>>
>>>>> But then if the site also defines a URL like this
>>>>> http://mysite/news/local/today.html, MCF's FileSystem connector fails
>>>>> trying to create the directory http/mysite/news/local because part of it,
>>>>> http/mysite/news, already exists as a file.
>>>>>
>>>>> Of course, if the URIs are crawled in the reverse order, the file
>>>>> can't be created because a directory already exists with that name.
>>>>>
>>>>> Make sense?
>>>>>
>>>>> The real killer is that when this happen it's fatal to the job. That
>>>>> is, it doesn't just fail to get that one URL, the connector returns a fatal
>>>>> error and the crawl is stopped.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>
>>>
>>
>

--001a11c339e8a6606204eb8f972d
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div>Hi Mark,<br><br>Yes, at least the mate=
rials I see online say that this is the case.=A0 But I don&#39;t know exact=
ly how.<br><br></div>For the purposes of the File System Output Connector, =
it doesn&#39;t matter, since anyone can construct a site that does NOT redi=
rect and still has the URL layout as you originally described.=A0 So the pr=
oblem has to be solved.<br>
<br></div>I can experiment with WGET here, to check out what its behavior m=
ight be, but not while I&#39;m doing Windows stuff - so I thought you might=
 be able to do that.<br><br></div>Thanks,<br></div>Karl<br><div><div><div>
<br></div></div></div></div><div class=3D"gmail_extra"><br><br><div class=
=3D"gmail_quote">On Tue, Nov 19, 2013 at 5:52 PM, Mark Libucha <span dir=3D=
"ltr">&lt;<a href=3D"mailto:mlibucha@gmail.com" target=3D"_blank">mlibucha@=
gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">So you&#39;re saying wget c=
an be run in a mode whereby it follows the redirect to fetch the content bu=
t uses the original, pre-redirect url to create the directory to store the =
content?<br>
</div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra">
<br><br><div class=3D"gmail_quote">On Tue, Nov 19, 2013 at 2:41 PM, Karl Wr=
ight <span dir=3D"ltr">&lt;<a href=3D"mailto:daddywri@gmail.com" target=3D"=
_blank">daddywri@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex">

<div dir=3D"ltr"><div>Hi Mark,<br><br></div>Yes, but I&#39;m afraid we *can=
&#39;t* emulate the redirect behavior because that&#39;s an upstream connec=
tor choice.=A0 WGet can operate in a mode where it uses the pre-redirect UR=
L, and resolves conflicts nonetheless.=A0 How does it do it?<span><font col=
or=3D"#888888"><br>


<br>Karl<br><br></font></span></div><div><div><div class=3D"gmail_extra"><b=
r><br><div class=3D"gmail_quote">On Tue, Nov 19, 2013 at 5:33 PM, Mark Libu=
cha <span dir=3D"ltr">&lt;<a href=3D"mailto:mlibucha@gmail.com" target=3D"_=
blank">mlibucha@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div>wget -x=
 uses the redirect url as the basis for the path it creates.<br><br></div>S=
o, if <a href=3D"http://mysite/news" target=3D"_blank">http://mysite/news</=
a> returns a 302 redirecting to <a href=3D"http://mysite/news/index.html" t=
arget=3D"_blank">http://mysite/news/index.html</a>, wget saves as:<br>


<br></div>mysite/news/index.html<br><br></div>MCF, on the other hand, saves=
 as:<br><br>http/mysite/news<span><font color=3D"#888888"><br><br></font></=
span></div><span><font color=3D"#888888">Mark<br>
</font></span></div><div><div><div class=3D"gmail_extra"><br><br><div class=
=3D"gmail_quote">On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <span dir=3D"=
ltr">&lt;<a href=3D"mailto:daddywri@gmail.com" target=3D"_blank">daddywri@g=
mail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Hi Mark,<br><br></div>=
The filesystem connector is supposed to emulate WGET behavior.=A0 What does=
 WGET do in this case?<span><font color=3D"#888888"><br>


<br>Karl<br><br></font></span></div><div><div><div class=3D"gmail_extra"><b=
r><br><div class=3D"gmail_quote">On Tue, Nov 19, 2013 at 4:17 PM, Mark Libu=
cha <span dir=3D"ltr">&lt;<a href=3D"mailto:mlibucha@gmail.com" target=3D"_=
blank">mlibucha@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><d=
iv><div>Noticed this problem while crawling a web site and saving to the fi=
le system with the FileSystem output connector.<br>


<br></div>Let&#39;s say the website defines a URL like this:<br>
<br></div><a href=3D"http://mysite/news" target=3D"_blank">http://mysite/ne=
ws</a><br><br></div>That URI actually gets mapped to a file on the web serv=
er, say <a href=3D"http://mysite/news/index.html" target=3D"_blank">http://=
mysite/news/index.html</a>, but the <a href=3D"http://mysite/news" target=
=3D"_blank">http://mysite/news</a> URI does exist and gets sent as the docu=
mentURI to addOrReplaceDocument().<br>


<br></div>MCF&#39;s FileSystem connector gets the <a href=3D"http://mysite/=
news" target=3D"_blank">http://mysite/news</a> URL and creates a directory =
for saving that content that looks like this http/mysite/news, where news i=
s a file.<br>


<br>
</div>But then if the site also defines a URL like this <a href=3D"http://m=
ysite/news/local/today.html" target=3D"_blank">http://mysite/news/local/tod=
ay.html</a>, MCF&#39;s FileSystem connector fails trying to create the dire=
ctory http/mysite/news/local because part of it, http/mysite/news, already =
exists as a file.<br>


<br></div><div>Of course, if the URIs are crawled in the reverse order, the=
 file can&#39;t be created because a directory already exists with that nam=
e.<br><br></div>Make sense?<br><br></div><div>The real killer is that when =
this happen it&#39;s fatal to the job. That is, it doesn&#39;t just fail to=
 get that one URL, the connector returns a fatal error and the crawl is sto=
pped.<span><font color=3D"#888888"><br>


<br></font></span></div><span><font color=3D"#888888">Mark<br><br></font></=
span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a11c339e8a6606204eb8f972d--