manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com>
Subject Re: Anyone out there using RSS connector, who wants to help?
Date Sun, 18 Nov 2012 23:17:00 GMT
Hi Karl,

I re-ran experiments with r1411016 and both RSS URLs are working now with CONNECTORS-120.

Regarding robots.txt, http://rss.hurriyet.com.tr/robots.txt does not exists but http://www.hurriyet.com.tr/robots.txt
exists.

Ahmet

--- On Sun, 11/18/12, Karl Wright <daddywri@gmail.com> wrote:

> From: Karl Wright <daddywri@gmail.com>
> Subject: Re: Anyone out there using RSS connector, who wants to help?
> To: "Ahmet Arslan" <iorixxx@yahoo.com>, "dev@manifoldcf.apache.org" <dev@manifoldcf.apache.org>
> Date: Sunday, November 18, 2012, 8:04 PM
> Hi Ahmet,
> 
> I tried your example, but it looked like it worked fine
> here.  Here's
> part of the simple history:
> 
> >>>>>>
> 11-18-2012 12:59:52.182     document ingest
> (null)
>     http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
> ndem/gundemdetay/18.11.2012/1628733/default.htm
>     OK     16307
>     1     
> 11-18-2012 12:59:47.482     document ingest
> (null)
>     http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
> gundemdetay/18.11.2012/1628657/default.htm
>     OK     10573
>     1     
> 11-18-2012 12:59:47.133     fetch
>     http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
> ndem/gundemdetay/18.11.2012/1628733/default.htm
>     200     16307
>     5050     
> 11-18-2012 12:59:42.133     fetch
>     http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
> gundemdetay/18.11.2012/1628657/default.htm
>     200     10573
>     5340     
> 11-18-2012 12:59:42.092     document ingest
> (null)
>     http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
> sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
>     OK     10212
>     1     
> 11-18-2012 12:59:37.252     document ingest
> (null)
>     http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
> nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
>     OK     16105
>     1     
> 11-18-2012 12:59:37.133     fetch
>     http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
> sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
>     200     10212
>     4950     
> 11-18-2012 12:59:32.332     document ingest
> (null)
>     http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
> m/gundemdetay/18.11.2012/1628801/default.htm
>     OK     10170
>     1     
> 11-18-2012 12:59:32.133     fetch
>     http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
> nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
>     200     16105
>     5110     
> 11-18-2012 12:59:27.142     document ingest
> (null)
>     http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
> ndemdetay/18.11.2012/1628661/default.htm
>     OK     10102
>     1     
> 11-18-2012 12:59:27.133     fetch
>     http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
> m/gundemdetay/18.11.2012/1628801/default.htm
>     200     10170
>     5200     
> 11-18-2012 12:59:22.182     document ingest
> (null)
>     http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
> gundemdetay/18.11.2012/1628824/default.htm
>     OK     10217
>     1     
> 11-18-2012 12:59:22.133     fetch
>     http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
> ndemdetay/18.11.2012/1628661/default.htm
>     200     10102
>     4990     
> 11-18-2012 12:59:18.062     document ingest
> (null)
>     http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
> /gundemdetay/18.11.2012/1628856/default.htm
>     OK     9721
>     1     
> 11-18-2012 12:59:17.133     fetch
>     http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
> gundemdetay/18.11.2012/1628824/default.htm
>     200     10217
>     5050     
> 11-18-2012 12:59:12.452     document ingest
> (null)
>     http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
> etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
>     OK     11412
>     1     
> 11-18-2012 12:59:12.133     fetch
>     http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
> /gundemdetay/18.11.2012/1628856/default.htm
>     200     9721
>     5930     
> 11-18-2012 12:59:07.133     fetch
>     http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
> etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
>     200     11412
>     5300     
> 11-18-2012 12:59:06.892     document ingest
> (null)
>     http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
> gundemdetay/17.11.2012/1628402/default.htm
>     OK     11183
>     1     
> 11-18-2012 12:59:02.772     document ingest
> (null)
>     http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
> /gundem/gundemdetay/18.11.2012/1628740/default.htm
>     OK     10632
>     1     
> 11-18-2012 12:59:02.153     fetch
>     http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
> gundemdetay/17.11.2012/1628402/default.htm
>     200     11183
>     4720     
> 11-18-2012 12:58:57.173     fetch
>     http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
> /gundem/gundemdetay/18.11.2012/1628740/default.htm
>     200     10632
>     5570     
> 11-18-2012 12:58:52.533     robots parse
>     www.hurriyet.com.tr
>     SUCCESS     0
>     78     
> 11-18-2012 12:58:52.511     robots parse
>     gundem.milliyet.com.tr
>     SUCCESS     0
>     70     
> 11-18-2012 12:58:52.136     fetch
>     http://www.hurriyet.com.tr/robots.txt
>     200     928
>     476     
> 11-18-2012 12:58:52.129     fetch
>     http://gundem.milliyet.com.tr/robots.txt
>     200     797
>     453     
> 11-18-2012 12:58:49.013     fetch
>     http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>     200     34467
>     1080     
> 11-18-2012 12:58:48.993     fetch
>     http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>     200     72439
>     1510     
> 11-18-2012 12:58:44.513     robots parse
>     www.milliyet.com.tr
>     SUCCESS     0
>     340     
> 11-18-2012 12:58:44.013     fetch
>     http://rss.hurriyet.com.tr/robots.txt
>     404     4096
>     770     
> 11-18-2012 12:58:44.013     fetch
>     http://www.milliyet.com.tr/robots.txt
>     200     17484
>     840     
> 11-18-2012 12:58:41.502     job start
>     1353261469661(rss)
>         0     1
>     
> <<<<<<
> 
> So it looks like there's a http://www.milliyet.com.tr/robots.txt that
> it fetched fine, and there is no
> http://rss.hurriyet.com.tr/robots.txt.  Does this
> seem correct to you?
>  Furthermore, there is content that the feed points at that
> requires
> access to (and robots fetches for) two other servers...
> 
> Karl
> 
> On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <daddywri@gmail.com>
> wrote:
> > Odd. The problem is obviously the port of -1. But the
> code does not
> > attach a specific port to the URL in that case.
> >
> > I will try your example exactly when I have access to
> internet again.
> >
> > Karl
> >
> > Sent from my Windows Phone
> > From: Ahmet Arslan
> > Sent: 11/17/2012 4:47 PM
> > To: dev@manifoldcf.apache.org
> > Subject: Re: Anyone out there using RSS connector, who
> wants to help?
> > Hi,
> >
> > Regarding  "WARN 2012-11-17 23:01:17,649 (Worker
> thread '31') -
> > Pre-ingest service interruption reported for job
> 1353185325276
> > connection 'rss': Couldn't fetch robots.txt from
> > http://www.milliyet.com.tr:-1"
> >
> > I see that http://www.milliyet.com.tr/robots.txt exists.
> >
> > Ahmet
> >
> > --- On Sat, 11/17/12, Ahmet Arslan <iorixxx@yahoo.com>
> wrote:
> >
> >> From: Ahmet Arslan <iorixxx@yahoo.com>
> >> Subject: Re: Anyone out there using RSS connector,
> who wants to help?
> >> To: dev@manifoldcf.apache.org
> >> Date: Saturday, November 17, 2012, 11:11 PM
> >> Hi Karl,
> >>
> >> Never used rss connector. But here is what I have
> done.
> >>
> >> I defined a job to crawl using mcf-trunk. mfc-trunk
> crawled
> >> following two URLs:
> >>
> >> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
> >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
> >>
> >> With CONNECTORS-120 branch I can crawl
> >>
> >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
> >>
> >> but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
> >> status of "Error: Repeated service interruptions -
> failure
> >> getting document version"
> >>
> >> I see these in the log file :
> >>
> >>  WARN 2012-11-17 23:01:17,649 (Worker thread
> '31') -
> >> Pre-ingest service interruption reported for job
> >> 1353185325276 connection 'rss': Couldn't fetch
> robots.txt
> >> from http://www.milliyet.com.tr:-1
> >> ERROR 2012-11-17 23:01:17,802 (Worker thread '31')
> -
> >> Exception tossed: Repeated service interruptions -
> failure
> >> getting document version
> >>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> >> Repeated service interruptions - failure getting
> document
> >> version
> >>     at
> >>
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
> >>  WARN 2012-11-17 23:02:27,307 (Worker thread
> '30') -
> >> Pre-ingest service interruption reported for job
> >> 1353185325276 connection 'rss': Couldn't fetch
> robots.txt
> >> from http://www.milliyet.com.tr:-1
> >> ERROR 2012-11-17 23:02:27,329 (Worker thread '30')
> -
> >> Exception tossed: Repeated service interruptions -
> failure
> >> getting document version
> >>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> >> Repeated service interruptions - failure getting
> document
> >> version
> >>     at
> >>
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
> >>
> >>
> >> By the way in "Dechromed Content" tab (Job Setting
> UI) I see
> >> four "&nbsp;"
> >>
> >> Thanks,
> >> Ahmet
> >> --- On Fri, 11/16/12, Karl Wright <daddywri@gmail.com>
> >> wrote:
> >>
> >> > From: Karl Wright <daddywri@gmail.com>
> >> > Subject: Anyone out there using RSS connector,
> who
> >> wants to help?
> >> > To: "dev" <dev@manifoldcf.apache.org>
> >> > Date: Friday, November 16, 2012, 3:54 PM
> >> > Hi all,
> >> >
> >> > The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
> >> > contains an RSS connector that has been
> updated to use
> >> > httpcomponents
> >> > 4.2.2.  I'd love for people who are in a
> position to
> >> do
> >> > significant
> >> > RSS crawling to try it out before I pull it
> into
> >> > trunk.  Any takers?
> >> >
> >> > Karl
> >> >
> >>
> 

Mime
View raw message