manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Li┼╝ewski <maciej.lizew...@gmail.com>
Subject Re: Anyone out there using RSS connector, who wants to help?
Date Tue, 20 Nov 2012 12:09:20 GMT
CONNECTORS-120 is already merged to trunk as I see. Tested wiki connector
in my environment and works correctly.


2012/11/19 Ahmet Arslan <iorixxx@yahoo.com>

> Hi Karl,
>
> I re-ran experiments with r1411016 and both RSS URLs are working now with
> CONNECTORS-120.
>
> Regarding robots.txt, http://rss.hurriyet.com.tr/robots.txt does not
> exists but http://www.hurriyet.com.tr/robots.txt exists.
>
> Ahmet
>
> --- On Sun, 11/18/12, Karl Wright <daddywri@gmail.com> wrote:
>
> > From: Karl Wright <daddywri@gmail.com>
> > Subject: Re: Anyone out there using RSS connector, who wants to help?
> > To: "Ahmet Arslan" <iorixxx@yahoo.com>, "dev@manifoldcf.apache.org" <
> dev@manifoldcf.apache.org>
> > Date: Sunday, November 18, 2012, 8:04 PM
> > Hi Ahmet,
> >
> > I tried your example, but it looked like it worked fine
> > here.  Here's
> > part of the simple history:
> >
> > >>>>>>
> > 11-18-2012 12:59:52.182     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
> > ndem/gundemdetay/18.11.2012/1628733/default.htm
> >     OK     16307
> >     1
> > 11-18-2012 12:59:47.482     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
> > gundemdetay/18.11.2012/1628657/default.htm
> >     OK     10573
> >     1
> > 11-18-2012 12:59:47.133     fetch
> >     http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
> > ndem/gundemdetay/18.11.2012/1628733/default.htm
> >     200     16307
> >     5050
> > 11-18-2012 12:59:42.133     fetch
> >     http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
> > gundemdetay/18.11.2012/1628657/default.htm
> >     200     10573
> >     5340
> > 11-18-2012 12:59:42.092     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
> >     OK     10212
> >     1
> > 11-18-2012 12:59:37.252     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
> >     OK     16105
> >     1
> > 11-18-2012 12:59:37.133     fetch
> >     http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
> >     200     10212
> >     4950
> > 11-18-2012 12:59:32.332     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
> > m/gundemdetay/18.11.2012/1628801/default.htm
> >     OK     10170
> >     1
> > 11-18-2012 12:59:32.133     fetch
> >     http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
> >     200     16105
> >     5110
> > 11-18-2012 12:59:27.142     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
> > ndemdetay/18.11.2012/1628661/default.htm
> >     OK     10102
> >     1
> > 11-18-2012 12:59:27.133     fetch
> >     http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
> > m/gundemdetay/18.11.2012/1628801/default.htm
> >     200     10170
> >     5200
> > 11-18-2012 12:59:22.182     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
> > gundemdetay/18.11.2012/1628824/default.htm
> >     OK     10217
> >     1
> > 11-18-2012 12:59:22.133     fetch
> >     http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
> > ndemdetay/18.11.2012/1628661/default.htm
> >     200     10102
> >     4990
> > 11-18-2012 12:59:18.062     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
> > /gundemdetay/18.11.2012/1628856/default.htm
> >     OK     9721
> >     1
> > 11-18-2012 12:59:17.133     fetch
> >     http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
> > gundemdetay/18.11.2012/1628824/default.htm
> >     200     10217
> >     5050
> > 11-18-2012 12:59:12.452     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
> >     OK     11412
> >     1
> > 11-18-2012 12:59:12.133     fetch
> >     http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
> > /gundemdetay/18.11.2012/1628856/default.htm
> >     200     9721
> >     5930
> > 11-18-2012 12:59:07.133     fetch
> >     http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
> >     200     11412
> >     5300
> > 11-18-2012 12:59:06.892     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
> > gundemdetay/17.11.2012/1628402/default.htm
> >     OK     11183
> >     1
> > 11-18-2012 12:59:02.772     document ingest
> > (null)
> >     http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
> > /gundem/gundemdetay/18.11.2012/1628740/default.htm
> >     OK     10632
> >     1
> > 11-18-2012 12:59:02.153     fetch
> >     http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
> > gundemdetay/17.11.2012/1628402/default.htm
> >     200     11183
> >     4720
> > 11-18-2012 12:58:57.173     fetch
> >     http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
> > /gundem/gundemdetay/18.11.2012/1628740/default.htm
> >     200     10632
> >     5570
> > 11-18-2012 12:58:52.533     robots parse
> >     www.hurriyet.com.tr
> >     SUCCESS     0
> >     78
> > 11-18-2012 12:58:52.511     robots parse
> >     gundem.milliyet.com.tr
> >     SUCCESS     0
> >     70
> > 11-18-2012 12:58:52.136     fetch
> >     http://www.hurriyet.com.tr/robots.txt
> >     200     928
> >     476
> > 11-18-2012 12:58:52.129     fetch
> >     http://gundem.milliyet.com.tr/robots.txt
> >     200     797
> >     453
> > 11-18-2012 12:58:49.013     fetch
> >     http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
> >     200     34467
> >     1080
> > 11-18-2012 12:58:48.993     fetch
> >     http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
> >     200     72439
> >     1510
> > 11-18-2012 12:58:44.513     robots parse
> >     www.milliyet.com.tr
> >     SUCCESS     0
> >     340
> > 11-18-2012 12:58:44.013     fetch
> >     http://rss.hurriyet.com.tr/robots.txt
> >     404     4096
> >     770
> > 11-18-2012 12:58:44.013     fetch
> >     http://www.milliyet.com.tr/robots.txt
> >     200     17484
> >     840
> > 11-18-2012 12:58:41.502     job start
> >     1353261469661(rss)
> >         0     1
> >
> > <<<<<<
> >
> > So it looks like there's a http://www.milliyet.com.tr/robots.txt that
> > it fetched fine, and there is no
> > http://rss.hurriyet.com.tr/robots.txt.  Does this
> > seem correct to you?
> >  Furthermore, there is content that the feed points at that
> > requires
> > access to (and robots fetches for) two other servers...
> >
> > Karl
> >
> > On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <daddywri@gmail.com>
> > wrote:
> > > Odd. The problem is obviously the port of -1. But the
> > code does not
> > > attach a specific port to the URL in that case.
> > >
> > > I will try your example exactly when I have access to
> > internet again.
> > >
> > > Karl
> > >
> > > Sent from my Windows Phone
> > > From: Ahmet Arslan
> > > Sent: 11/17/2012 4:47 PM
> > > To: dev@manifoldcf.apache.org
> > > Subject: Re: Anyone out there using RSS connector, who
> > wants to help?
> > > Hi,
> > >
> > > Regarding  "WARN 2012-11-17 23:01:17,649 (Worker
> > thread '31') -
> > > Pre-ingest service interruption reported for job
> > 1353185325276
> > > connection 'rss': Couldn't fetch robots.txt from
> > > http://www.milliyet.com.tr:-1"
> > >
> > > I see that http://www.milliyet.com.tr/robots.txt exists.
> > >
> > > Ahmet
> > >
> > > --- On Sat, 11/17/12, Ahmet Arslan <iorixxx@yahoo.com>
> > wrote:
> > >
> > >> From: Ahmet Arslan <iorixxx@yahoo.com>
> > >> Subject: Re: Anyone out there using RSS connector,
> > who wants to help?
> > >> To: dev@manifoldcf.apache.org
> > >> Date: Saturday, November 17, 2012, 11:11 PM
> > >> Hi Karl,
> > >>
> > >> Never used rss connector. But here is what I have
> > done.
> > >>
> > >> I defined a job to crawl using mcf-trunk. mfc-trunk
> > crawled
> > >> following two URLs:
> > >>
> > >> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
> > >>
> > >> With CONNECTORS-120 branch I can crawl
> > >>
> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
> > >>
> > >> but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
> > >> status of "Error: Repeated service interruptions -
> > failure
> > >> getting document version"
> > >>
> > >> I see these in the log file :
> > >>
> > >>  WARN 2012-11-17 23:01:17,649 (Worker thread
> > '31') -
> > >> Pre-ingest service interruption reported for job
> > >> 1353185325276 connection 'rss': Couldn't fetch
> > robots.txt
> > >> from http://www.milliyet.com.tr:-1
> > >> ERROR 2012-11-17 23:01:17,802 (Worker thread '31')
> > -
> > >> Exception tossed: Repeated service interruptions -
> > failure
> > >> getting document version
> > >>
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> > >> Repeated service interruptions - failure getting
> > document
> > >> version
> > >>     at
> > >>
> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
> > >>  WARN 2012-11-17 23:02:27,307 (Worker thread
> > '30') -
> > >> Pre-ingest service interruption reported for job
> > >> 1353185325276 connection 'rss': Couldn't fetch
> > robots.txt
> > >> from http://www.milliyet.com.tr:-1
> > >> ERROR 2012-11-17 23:02:27,329 (Worker thread '30')
> > -
> > >> Exception tossed: Repeated service interruptions -
> > failure
> > >> getting document version
> > >>
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> > >> Repeated service interruptions - failure getting
> > document
> > >> version
> > >>     at
> > >>
> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
> > >>
> > >>
> > >> By the way in "Dechromed Content" tab (Job Setting
> > UI) I see
> > >> four " "
> > >>
> > >> Thanks,
> > >> Ahmet
> > >> --- On Fri, 11/16/12, Karl Wright <daddywri@gmail.com>
> > >> wrote:
> > >>
> > >> > From: Karl Wright <daddywri@gmail.com>
> > >> > Subject: Anyone out there using RSS connector,
> > who
> > >> wants to help?
> > >> > To: "dev" <dev@manifoldcf.apache.org>
> > >> > Date: Friday, November 16, 2012, 3:54 PM
> > >> > Hi all,
> > >> >
> > >> > The branch
> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
> > >> > contains an RSS connector that has been
> > updated to use
> > >> > httpcomponents
> > >> > 4.2.2.  I'd love for people who are in a
> > position to
> > >> do
> > >> > significant
> > >> > RSS crawling to try it out before I pull it
> > into
> > >> > trunk.  Any takers?
> > >> >
> > >> > Karl
> > >> >
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message