manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Anyone out there using RSS connector, who wants to help?
Date Sun, 18 Nov 2012 18:04:04 GMT
Hi Ahmet,

I tried your example, but it looked like it worked fine here.  Here's
part of the simple history:

>>>>>>
11-18-2012 12:59:52.182 	document ingest (null)
	http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
ndem/gundemdetay/18.11.2012/1628733/default.htm
	OK 	16307 	1 	
11-18-2012 12:59:47.482 	document ingest (null)
	http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
gundemdetay/18.11.2012/1628657/default.htm
	OK 	10573 	1 	
11-18-2012 12:59:47.133 	fetch
	http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
ndem/gundemdetay/18.11.2012/1628733/default.htm
	200 	16307 	5050 	
11-18-2012 12:59:42.133 	fetch
	http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
gundemdetay/18.11.2012/1628657/default.htm
	200 	10573 	5340 	
11-18-2012 12:59:42.092 	document ingest (null)
	http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
	OK 	10212 	1 	
11-18-2012 12:59:37.252 	document ingest (null)
	http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
	OK 	16105 	1 	
11-18-2012 12:59:37.133 	fetch
	http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
	200 	10212 	4950 	
11-18-2012 12:59:32.332 	document ingest (null)
	http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
m/gundemdetay/18.11.2012/1628801/default.htm
	OK 	10170 	1 	
11-18-2012 12:59:32.133 	fetch
	http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
	200 	16105 	5110 	
11-18-2012 12:59:27.142 	document ingest (null)
	http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
ndemdetay/18.11.2012/1628661/default.htm
	OK 	10102 	1 	
11-18-2012 12:59:27.133 	fetch
	http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
m/gundemdetay/18.11.2012/1628801/default.htm
	200 	10170 	5200 	
11-18-2012 12:59:22.182 	document ingest (null)
	http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
gundemdetay/18.11.2012/1628824/default.htm
	OK 	10217 	1 	
11-18-2012 12:59:22.133 	fetch
	http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
ndemdetay/18.11.2012/1628661/default.htm
	200 	10102 	4990 	
11-18-2012 12:59:18.062 	document ingest (null)
	http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
/gundemdetay/18.11.2012/1628856/default.htm
	OK 	9721 	1 	
11-18-2012 12:59:17.133 	fetch
	http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
gundemdetay/18.11.2012/1628824/default.htm
	200 	10217 	5050 	
11-18-2012 12:59:12.452 	document ingest (null)
	http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
	OK 	11412 	1 	
11-18-2012 12:59:12.133 	fetch
	http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
/gundemdetay/18.11.2012/1628856/default.htm
	200 	9721 	5930 	
11-18-2012 12:59:07.133 	fetch
	http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
	200 	11412 	5300 	
11-18-2012 12:59:06.892 	document ingest (null)
	http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
gundemdetay/17.11.2012/1628402/default.htm
	OK 	11183 	1 	
11-18-2012 12:59:02.772 	document ingest (null)
	http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
/gundem/gundemdetay/18.11.2012/1628740/default.htm
	OK 	10632 	1 	
11-18-2012 12:59:02.153 	fetch
	http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
gundemdetay/17.11.2012/1628402/default.htm
	200 	11183 	4720 	
11-18-2012 12:58:57.173 	fetch
	http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
/gundem/gundemdetay/18.11.2012/1628740/default.htm
	200 	10632 	5570 	
11-18-2012 12:58:52.533 	robots parse 	www.hurriyet.com.tr
	SUCCESS 	0 	78 	
11-18-2012 12:58:52.511 	robots parse 	gundem.milliyet.com.tr
	SUCCESS 	0 	70 	
11-18-2012 12:58:52.136 	fetch 	http://www.hurriyet.com.tr/robots.txt
	200 	928 	476 	
11-18-2012 12:58:52.129 	fetch 	http://gundem.milliyet.com.tr/robots.txt
	200 	797 	453 	
11-18-2012 12:58:49.013 	fetch 	http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
	200 	34467 	1080 	
11-18-2012 12:58:48.993 	fetch 	http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
	200 	72439 	1510 	
11-18-2012 12:58:44.513 	robots parse 	www.milliyet.com.tr
	SUCCESS 	0 	340 	
11-18-2012 12:58:44.013 	fetch 	http://rss.hurriyet.com.tr/robots.txt
	404 	4096 	770 	
11-18-2012 12:58:44.013 	fetch 	http://www.milliyet.com.tr/robots.txt
	200 	17484 	840 	
11-18-2012 12:58:41.502 	job start 	1353261469661(rss)
		0 	1 	
<<<<<<

So it looks like there's a http://www.milliyet.com.tr/robots.txt that
it fetched fine, and there is no
http://rss.hurriyet.com.tr/robots.txt.  Does this seem correct to you?
 Furthermore, there is content that the feed points at that requires
access to (and robots fetches for) two other servers...

Karl

On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <daddywri@gmail.com> wrote:
> Odd. The problem is obviously the port of -1. But the code does not
> attach a specific port to the URL in that case.
>
> I will try your example exactly when I have access to internet again.
>
> Karl
>
> Sent from my Windows Phone
> From: Ahmet Arslan
> Sent: 11/17/2012 4:47 PM
> To: dev@manifoldcf.apache.org
> Subject: Re: Anyone out there using RSS connector, who wants to help?
> Hi,
>
> Regarding  "WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
> Pre-ingest service interruption reported for job 1353185325276
> connection 'rss': Couldn't fetch robots.txt from
> http://www.milliyet.com.tr:-1"
>
> I see that http://www.milliyet.com.tr/robots.txt exists.
>
> Ahmet
>
> --- On Sat, 11/17/12, Ahmet Arslan <iorixxx@yahoo.com> wrote:
>
>> From: Ahmet Arslan <iorixxx@yahoo.com>
>> Subject: Re: Anyone out there using RSS connector, who wants to help?
>> To: dev@manifoldcf.apache.org
>> Date: Saturday, November 17, 2012, 11:11 PM
>> Hi Karl,
>>
>> Never used rss connector. But here is what I have done.
>>
>> I defined a job to crawl using mcf-trunk. mfc-trunk crawled
>> following two URLs:
>>
>> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>
>> With CONNECTORS-120 branch I can crawl
>>
>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>
>> but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
>> status of "Error: Repeated service interruptions - failure
>> getting document version"
>>
>> I see these in the log file :
>>
>>  WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
>> Pre-ingest service interruption reported for job
>> 1353185325276 connection 'rss': Couldn't fetch robots.txt
>> from http://www.milliyet.com.tr:-1
>> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') -
>> Exception tossed: Repeated service interruptions - failure
>> getting document version
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> Repeated service interruptions - failure getting document
>> version
>>     at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>>  WARN 2012-11-17 23:02:27,307 (Worker thread '30') -
>> Pre-ingest service interruption reported for job
>> 1353185325276 connection 'rss': Couldn't fetch robots.txt
>> from http://www.milliyet.com.tr:-1
>> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') -
>> Exception tossed: Repeated service interruptions - failure
>> getting document version
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> Repeated service interruptions - failure getting document
>> version
>>     at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>>
>>
>> By the way in "Dechromed Content" tab (Job Setting UI) I see
>> four "&nbsp;"
>>
>> Thanks,
>> Ahmet
>> --- On Fri, 11/16/12, Karl Wright <daddywri@gmail.com>
>> wrote:
>>
>> > From: Karl Wright <daddywri@gmail.com>
>> > Subject: Anyone out there using RSS connector, who
>> wants to help?
>> > To: "dev" <dev@manifoldcf.apache.org>
>> > Date: Friday, November 16, 2012, 3:54 PM
>> > Hi all,
>> >
>> > The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
>> > contains an RSS connector that has been updated to use
>> > httpcomponents
>> > 4.2.2.  I'd love for people who are in a position to
>> do
>> > significant
>> > RSS crawling to try it out before I pull it into
>> > trunk.  Any takers?
>> >
>> > Karl
>> >
>>

Mime
View raw message