manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject RE: Anyone out there using RSS connector, who wants to help?
Date Sun, 18 Nov 2012 08:07:35 GMT
Odd. The problem is obviously the port of -1. But the code does not
attach a specific port to the URL in that case.

I will try your example exactly when I have access to internet again.

Karl

Sent from my Windows Phone
From: Ahmet Arslan
Sent: 11/17/2012 4:47 PM
To: dev@manifoldcf.apache.org
Subject: Re: Anyone out there using RSS connector, who wants to help?
Hi,

Regarding  "WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
Pre-ingest service interruption reported for job 1353185325276
connection 'rss': Couldn't fetch robots.txt from
http://www.milliyet.com.tr:-1"

I see that http://www.milliyet.com.tr/robots.txt exists.

Ahmet

--- On Sat, 11/17/12, Ahmet Arslan <iorixxx@yahoo.com> wrote:

> From: Ahmet Arslan <iorixxx@yahoo.com>
> Subject: Re: Anyone out there using RSS connector, who wants to help?
> To: dev@manifoldcf.apache.org
> Date: Saturday, November 17, 2012, 11:11 PM
> Hi Karl,
>
> Never used rss connector. But here is what I have done.
>
> I defined a job to crawl using mcf-trunk. mfc-trunk crawled
> following two URLs:
>
> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>
> With CONNECTORS-120 branch I can crawl
>
> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>
> but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
> status of "Error: Repeated service interruptions - failure
> getting document version"
>
> I see these in the log file :
>
>  WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
> Pre-ingest service interruption reported for job
> 1353185325276 connection 'rss': Couldn't fetch robots.txt
> from http://www.milliyet.com.tr:-1
> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') -
> Exception tossed: Repeated service interruptions - failure
> getting document version
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Repeated service interruptions - failure getting document
> version
>     at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>  WARN 2012-11-17 23:02:27,307 (Worker thread '30') -
> Pre-ingest service interruption reported for job
> 1353185325276 connection 'rss': Couldn't fetch robots.txt
> from http://www.milliyet.com.tr:-1
> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') -
> Exception tossed: Repeated service interruptions - failure
> getting document version
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Repeated service interruptions - failure getting document
> version
>     at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>
>
> By the way in "Dechromed Content" tab (Job Setting UI) I see
> four "&nbsp;"
>
> Thanks,
> Ahmet
> --- On Fri, 11/16/12, Karl Wright <daddywri@gmail.com>
> wrote:
>
> > From: Karl Wright <daddywri@gmail.com>
> > Subject: Anyone out there using RSS connector, who
> wants to help?
> > To: "dev" <dev@manifoldcf.apache.org>
> > Date: Friday, November 16, 2012, 3:54 PM
> > Hi all,
> >
> > The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
> > contains an RSS connector that has been updated to use
> > httpcomponents
> > 4.2.2.  I'd love for people who are in a position to
> do
> > significant
> > RSS crawling to try it out before I pull it into
> > trunk.  Any takers?
> >
> > Karl
> >
>

Mime
View raw message