Return-Path: X-Original-To: apmail-manifoldcf-dev-archive@www.apache.org Delivered-To: apmail-manifoldcf-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6318BD11E for ; Sun, 18 Nov 2012 08:08:04 +0000 (UTC) Received: (qmail 2201 invoked by uid 500); 18 Nov 2012 08:08:04 -0000 Delivered-To: apmail-manifoldcf-dev-archive@manifoldcf.apache.org Received: (qmail 2129 invoked by uid 500); 18 Nov 2012 08:08:02 -0000 Mailing-List: contact dev-help@manifoldcf.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@manifoldcf.apache.org Delivered-To: mailing list dev@manifoldcf.apache.org Received: (qmail 2088 invoked by uid 99); 18 Nov 2012 08:08:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Nov 2012 08:08:01 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of daddywri@gmail.com designates 209.85.223.178 as permitted sender) Received: from [209.85.223.178] (HELO mail-ie0-f178.google.com) (209.85.223.178) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 18 Nov 2012 08:07:56 +0000 Received: by mail-ie0-f178.google.com with SMTP id e11so5222418iej.9 for ; Sun, 18 Nov 2012 00:07:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=XjRW8aLBjm9LHWtSdK7ETD4OmH3NUmUgkpB4ZMedvys=; b=Kv1MISKmqX92xLAbgyRszV9wBUXCGkG575SolHF3Es6JkV2ThAcv4sBtx9P4seXUpi 8VllSaN/81zO3VgFg4yc99pnijxexXOrbGdrpiVhp9YcGE3PUT+GMpthGT//rolSgKHX AuY/DNU60sXe5KgucVbhjY00+OPEOmyHART1KRU3YMfAZVMV6IV6WtqSP1jQz9+47vYM lvIMxNyqkX96A5tk3jVNyocpZgB22oqJJEBXBhAgAOCrU5BmElYKnGJHaAht+MiI7Obh HzvFpJCo2QT4DIg/uiTfeMmdqYsl2rqi+kW3FtZoZ+e1UtZ8VsqT4i9DHbFAOsYvn/aB csCQ== Received: by 10.50.5.177 with SMTP id t17mr3644830igt.48.1353226055746; Sun, 18 Nov 2012 00:07:35 -0800 (PST) MIME-Version: 1.0 From: Karl Wright Date: Sun, 18 Nov 2012 00:07:35 -0800 Message-ID: <-3111902283003351014@unknownmsgid> Subject: RE: Anyone out there using RSS connector, who wants to help? To: Ahmet Arslan , "dev@manifoldcf.apache.org" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Odd. The problem is obviously the port of -1. But the code does not attach a specific port to the URL in that case. I will try your example exactly when I have access to internet again. Karl Sent from my Windows Phone From: Ahmet Arslan Sent: 11/17/2012 4:47 PM To: dev@manifoldcf.apache.org Subject: Re: Anyone out there using RSS connector, who wants to help? Hi, Regarding "WARN 2012-11-17 23:01:17,649 (Worker thread '31') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1" I see that http://www.milliyet.com.tr/robots.txt exists. Ahmet --- On Sat, 11/17/12, Ahmet Arslan wrote: > From: Ahmet Arslan > Subject: Re: Anyone out there using RSS connector, who wants to help? > To: dev@manifoldcf.apache.org > Date: Saturday, November 17, 2012, 11:11 PM > Hi Karl, > > Never used rss connector. But here is what I have done. > > I defined a job to crawl using mcf-trunk. mfc-trunk crawled > following two URLs: > > http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml > http://rss.hurriyet.com.tr/rss.aspx?sectionId=3D2 > > With CONNECTORS-120 branch I can crawl > > http://rss.hurriyet.com.tr/rss.aspx?sectionId=3D2 > > but=A0 http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives > status of "Error: Repeated service interruptions - failure > getting document version" > > I see these in the log file : > > WARN 2012-11-17 23:01:17,649 (Worker thread '31') - > Pre-ingest service interruption reported for job > 1353185325276 connection 'rss': Couldn't fetch robots.txt > from http://www.milliyet.com.tr:-1 > ERROR 2012-11-17 23:01:17,802 (Worker thread '31') - > Exception tossed: Repeated service interruptions - failure > getting document version > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > Repeated service interruptions - failure getting document > version > =A0=A0=A0 at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:3= 39) > WARN 2012-11-17 23:02:27,307 (Worker thread '30') - > Pre-ingest service interruption reported for job > 1353185325276 connection 'rss': Couldn't fetch robots.txt > from http://www.milliyet.com.tr:-1 > ERROR 2012-11-17 23:02:27,329 (Worker thread '30') - > Exception tossed: Repeated service interruptions - failure > getting document version > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > Repeated service interruptions - failure getting document > version > =A0=A0=A0 at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:3= 39) > > > By the way in "Dechromed Content" tab (Job Setting UI) I see > four " " > > Thanks, > Ahmet > --- On Fri, 11/16/12, Karl Wright > wrote: > > > From: Karl Wright > > Subject: Anyone out there using RSS connector, who > wants to help? > > To: "dev" > > Date: Friday, November 16, 2012, 3:54 PM > > Hi all, > > > > The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECT= ORS-120 > > contains an RSS connector that has been updated to use > > httpcomponents > > 4.2.2.=A0 I'd love for people who are in a position to > do > > significant > > RSS crawling to try it out before I pull it into > > trunk.=A0 Any takers? > > > > Karl > > >