manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: Anyone out there using RSS connector, who wants to help?
Date Sat, 17 Nov 2012 21:11:57 GMT
Hi Karl,

Never used rss connector. But here is what I have done. 

I defined a job to crawl using mcf-trunk. mfc-trunk crawled following two URLs:

http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
http://rss.hurriyet.com.tr/rss.aspx?sectionId=2

With CONNECTORS-120 branch I can crawl 

http://rss.hurriyet.com.tr/rss.aspx?sectionId=2

but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives status of "Error: Repeated service
interruptions - failure getting document version"

I see these in the log file :

 WARN 2012-11-17 23:01:17,649 (Worker thread '31') - Pre-ingest service interruption reported
for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1
ERROR 2012-11-17 23:01:17,802 (Worker thread '31') - Exception tossed: Repeated service interruptions
- failure getting document version
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions
- failure getting document version
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
 WARN 2012-11-17 23:02:27,307 (Worker thread '30') - Pre-ingest service interruption reported
for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1
ERROR 2012-11-17 23:02:27,329 (Worker thread '30') - Exception tossed: Repeated service interruptions
- failure getting document version
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions
- failure getting document version
	at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)


By the way in "Dechromed Content" tab (Job Setting UI) I see four "&nbsp;"   

Thanks,
Ahmet
--- On Fri, 11/16/12, Karl Wright <daddywri@gmail.com> wrote:

> From: Karl Wright <daddywri@gmail.com>
> Subject: Anyone out there using RSS connector, who wants to help?
> To: "dev" <dev@manifoldcf.apache.org>
> Date: Friday, November 16, 2012, 3:54 PM
> Hi all,
> 
> The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
> contains an RSS connector that has been updated to use
> httpcomponents
> 4.2.2.  I'd love for people who are in a position to do
> significant
> RSS crawling to try it out before I pull it into
> trunk.  Any takers?
> 
> Karl
> 

Mime
View raw message