Return-Path: X-Original-To: apmail-manifoldcf-user-archive@www.apache.org Delivered-To: apmail-manifoldcf-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 366E4102CC for ; Tue, 19 Nov 2013 22:58:12 +0000 (UTC) Received: (qmail 16495 invoked by uid 500); 19 Nov 2013 22:58:12 -0000 Delivered-To: apmail-manifoldcf-user-archive@manifoldcf.apache.org Received: (qmail 16417 invoked by uid 500); 19 Nov 2013 22:58:12 -0000 Mailing-List: contact user-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@manifoldcf.apache.org Delivered-To: mailing list user@manifoldcf.apache.org Received: (qmail 16409 invoked by uid 99); 19 Nov 2013 22:58:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Nov 2013 22:58:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daddywri@gmail.com designates 209.85.128.51 as permitted sender) Received: from [209.85.128.51] (HELO mail-qe0-f51.google.com) (209.85.128.51) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Nov 2013 22:58:04 +0000 Received: by mail-qe0-f51.google.com with SMTP id d4so2088512qej.10 for ; Tue, 19 Nov 2013 14:57:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=oel3uqmxETrRpKJPZYjn5myd1i+XOOoJy9PhWO0lkuM=; b=QiW8dWXPU1oEL/OuTItIYv+0b/9HaMMHVb5hXUd6VDDNAqqFTLZSehx5m2EvccRd68 xR5IbqzYUb1aSmIKojl/6DB/Ra0s/7vOyfHyqdkd+AnhwDXhjaqR5IzniIbFQiZQUHXd kIOPQkJ3aPA51tikTMBEMUckYPd9bh48bhXWGhTLhXO0Q/terAvb1fCLdO9gILnMuVgq AmlkVAW3CflWl8wFY/fh7RQuoMvuuqLNPZesR5cAA4HmOyaGWoc7g8aP/W6zUXTlaehk n3WMBhPTPLqfAEEstJTH8OApQKozcXOYpQhNdM3hPANu7d+MqT9fdzt95CUtwNWsOWyZ VOEA== MIME-Version: 1.0 X-Received: by 10.224.112.134 with SMTP id w6mr46304793qap.21.1384901863949; Tue, 19 Nov 2013 14:57:43 -0800 (PST) Received: by 10.96.177.35 with HTTP; Tue, 19 Nov 2013 14:57:43 -0800 (PST) In-Reply-To: References: Date: Tue, 19 Nov 2013 17:57:43 -0500 Message-ID: Subject: Re: FileSystem connector path issue From: Karl Wright To: "user@manifoldcf.apache.org" Content-Type: multipart/alternative; boundary=001a11c339e8a6606204eb8f972d X-Virus-Checked: Checked by ClamAV on apache.org --001a11c339e8a6606204eb8f972d Content-Type: text/plain; charset=ISO-8859-1 Hi Mark, Yes, at least the materials I see online say that this is the case. But I don't know exactly how. For the purposes of the File System Output Connector, it doesn't matter, since anyone can construct a site that does NOT redirect and still has the URL layout as you originally described. So the problem has to be solved. I can experiment with WGET here, to check out what its behavior might be, but not while I'm doing Windows stuff - so I thought you might be able to do that. Thanks, Karl On Tue, Nov 19, 2013 at 5:52 PM, Mark Libucha wrote: > So you're saying wget can be run in a mode whereby it follows the redirect > to fetch the content but uses the original, pre-redirect url to create the > directory to store the content? > > > On Tue, Nov 19, 2013 at 2:41 PM, Karl Wright wrote: > >> Hi Mark, >> >> Yes, but I'm afraid we *can't* emulate the redirect behavior because >> that's an upstream connector choice. WGet can operate in a mode where it >> uses the pre-redirect URL, and resolves conflicts nonetheless. How does it >> do it? >> >> Karl >> >> >> >> On Tue, Nov 19, 2013 at 5:33 PM, Mark Libucha wrote: >> >>> wget -x uses the redirect url as the basis for the path it creates. >>> >>> So, if http://mysite/news returns a 302 redirecting to >>> http://mysite/news/index.html, wget saves as: >>> >>> mysite/news/index.html >>> >>> MCF, on the other hand, saves as: >>> >>> http/mysite/news >>> >>> Mark >>> >>> >>> On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright wrote: >>> >>>> Hi Mark, >>>> >>>> The filesystem connector is supposed to emulate WGET behavior. What >>>> does WGET do in this case? >>>> >>>> Karl >>>> >>>> >>>> >>>> On Tue, Nov 19, 2013 at 4:17 PM, Mark Libucha wrote: >>>> >>>>> Noticed this problem while crawling a web site and saving to the file >>>>> system with the FileSystem output connector. >>>>> >>>>> Let's say the website defines a URL like this: >>>>> >>>>> http://mysite/news >>>>> >>>>> That URI actually gets mapped to a file on the web server, say >>>>> http://mysite/news/index.html, but the http://mysite/news URI does >>>>> exist and gets sent as the documentURI to addOrReplaceDocument(). >>>>> >>>>> MCF's FileSystem connector gets the http://mysite/news URL and >>>>> creates a directory for saving that content that looks like this >>>>> http/mysite/news, where news is a file. >>>>> >>>>> But then if the site also defines a URL like this >>>>> http://mysite/news/local/today.html, MCF's FileSystem connector fails >>>>> trying to create the directory http/mysite/news/local because part of it, >>>>> http/mysite/news, already exists as a file. >>>>> >>>>> Of course, if the URIs are crawled in the reverse order, the file >>>>> can't be created because a directory already exists with that name. >>>>> >>>>> Make sense? >>>>> >>>>> The real killer is that when this happen it's fatal to the job. That >>>>> is, it doesn't just fail to get that one URL, the connector returns a fatal >>>>> error and the crawl is stopped. >>>>> >>>>> Mark >>>>> >>>>> >>>> >>> >> > --001a11c339e8a6606204eb8f972d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Mark,

Yes, at least the mate= rials I see online say that this is the case.=A0 But I don't know exact= ly how.

For the purposes of the File System Output Connector, = it doesn't matter, since anyone can construct a site that does NOT redi= rect and still has the URL layout as you originally described.=A0 So the pr= oblem has to be solved.

I can experiment with WGET here, to check out what its behavior m= ight be, but not while I'm doing Windows stuff - so I thought you might= be able to do that.

Thanks,
Karl



On Tue, Nov 19, 2013 at 5:52 PM, Mark Libucha <mlibucha@= gmail.com> wrote:
So you're saying wget c= an be run in a mode whereby it follows the redirect to fetch the content bu= t uses the original, pre-redirect url to create the directory to store the = content?


On Tue, Nov 19, 2013 at 2:41 PM, Karl Wr= ight <daddywri@gmail.com> wrote:
Hi Mark,

Yes, but I'm afraid we *can= 't* emulate the redirect behavior because that's an upstream connec= tor choice.=A0 WGet can operate in a mode where it uses the pre-redirect UR= L, and resolves conflicts nonetheless.=A0 How does it do it?

Karl


On Tue, Nov 19, 2013 at 5:33 PM, Mark Libu= cha <mlibucha@gmail.com> wrote:
wget -x= uses the redirect url as the basis for the path it creates.

S= o, if http://mysite/news returns a 302 redirecting to http://mysite/news/index.html, wget saves as:

mysite/news/index.html

MCF, on the other hand, saves= as:

http/mysite/news

Mark


On Tue, Nov 19, 2013 at 2:15 PM, Karl Wright <daddywri@g= mail.com> wrote:
Hi Mark,

= The filesystem connector is supposed to emulate WGET behavior.=A0 What does= WGET do in this case?

Karl


On Tue, Nov 19, 2013 at 4:17 PM, Mark Libu= cha <mlibucha@gmail.com> wrote:
Noticed this problem while crawling a web site and saving to the fi= le system with the FileSystem output connector.

Let's say the website defines a URL like this:

http://mysite/ne= ws

That URI actually gets mapped to a file on the web serv= er, say http://= mysite/news/index.html, but the http://mysite/news URI does exist and gets sent as the docu= mentURI to addOrReplaceDocument().

MCF's FileSystem connector gets the http://mysite/news URL and creates a directory = for saving that content that looks like this http/mysite/news, where news i= s a file.

But then if the site also defines a URL like this http://mysite/news/local/tod= ay.html, MCF's FileSystem connector fails trying to create the dire= ctory http/mysite/news/local because part of it, http/mysite/news, already = exists as a file.

Of course, if the URIs are crawled in the reverse order, the= file can't be created because a directory already exists with that nam= e.

Make sense?

The real killer is that when = this happen it's fatal to the job. That is, it doesn't just fail to= get that one URL, the connector returns a fatal error and the crawl is sto= pped.

Mark






--001a11c339e8a6606204eb8f972d--