manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Anyone out there using RSS connector, who wants to help?
Date Tue, 20 Nov 2012 12:11:31 GMT
Thanks for the update!

I'm working on the web connector now.  That's going to require a bit more work.

Karl

On Tue, Nov 20, 2012 at 7:09 AM, Maciej Li┼╝ewski
<maciej.lizewski@gmail.com> wrote:
> CONNECTORS-120 is already merged to trunk as I see. Tested wiki connector
> in my environment and works correctly.
>
>
> 2012/11/19 Ahmet Arslan <iorixxx@yahoo.com>
>
>> Hi Karl,
>>
>> I re-ran experiments with r1411016 and both RSS URLs are working now with
>> CONNECTORS-120.
>>
>> Regarding robots.txt, http://rss.hurriyet.com.tr/robots.txt does not
>> exists but http://www.hurriyet.com.tr/robots.txt exists.
>>
>> Ahmet
>>
>> --- On Sun, 11/18/12, Karl Wright <daddywri@gmail.com> wrote:
>>
>> > From: Karl Wright <daddywri@gmail.com>
>> > Subject: Re: Anyone out there using RSS connector, who wants to help?
>> > To: "Ahmet Arslan" <iorixxx@yahoo.com>, "dev@manifoldcf.apache.org" <
>> dev@manifoldcf.apache.org>
>> > Date: Sunday, November 18, 2012, 8:04 PM
>> > Hi Ahmet,
>> >
>> > I tried your example, but it looked like it worked fine
>> > here.  Here's
>> > part of the simple history:
>> >
>> > >>>>>>
>> > 11-18-2012 12:59:52.182     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
>> > ndem/gundemdetay/18.11.2012/1628733/default.htm
>> >     OK     16307
>> >     1
>> > 11-18-2012 12:59:47.482     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
>> > gundemdetay/18.11.2012/1628657/default.htm
>> >     OK     10573
>> >     1
>> > 11-18-2012 12:59:47.133     fetch
>> >     http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
>> > ndem/gundemdetay/18.11.2012/1628733/default.htm
>> >     200     16307
>> >     5050
>> > 11-18-2012 12:59:42.133     fetch
>> >     http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
>> > gundemdetay/18.11.2012/1628657/default.htm
>> >     200     10573
>> >     5340
>> > 11-18-2012 12:59:42.092     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
>> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
>> >     OK     10212
>> >     1
>> > 11-18-2012 12:59:37.252     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
>> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
>> >     OK     16105
>> >     1
>> > 11-18-2012 12:59:37.133     fetch
>> >     http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
>> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
>> >     200     10212
>> >     4950
>> > 11-18-2012 12:59:32.332     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
>> > m/gundemdetay/18.11.2012/1628801/default.htm
>> >     OK     10170
>> >     1
>> > 11-18-2012 12:59:32.133     fetch
>> >     http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
>> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
>> >     200     16105
>> >     5110
>> > 11-18-2012 12:59:27.142     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
>> > ndemdetay/18.11.2012/1628661/default.htm
>> >     OK     10102
>> >     1
>> > 11-18-2012 12:59:27.133     fetch
>> >     http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
>> > m/gundemdetay/18.11.2012/1628801/default.htm
>> >     200     10170
>> >     5200
>> > 11-18-2012 12:59:22.182     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
>> > gundemdetay/18.11.2012/1628824/default.htm
>> >     OK     10217
>> >     1
>> > 11-18-2012 12:59:22.133     fetch
>> >     http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
>> > ndemdetay/18.11.2012/1628661/default.htm
>> >     200     10102
>> >     4990
>> > 11-18-2012 12:59:18.062     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
>> > /gundemdetay/18.11.2012/1628856/default.htm
>> >     OK     9721
>> >     1
>> > 11-18-2012 12:59:17.133     fetch
>> >     http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
>> > gundemdetay/18.11.2012/1628824/default.htm
>> >     200     10217
>> >     5050
>> > 11-18-2012 12:59:12.452     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
>> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
>> >     OK     11412
>> >     1
>> > 11-18-2012 12:59:12.133     fetch
>> >     http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
>> > /gundemdetay/18.11.2012/1628856/default.htm
>> >     200     9721
>> >     5930
>> > 11-18-2012 12:59:07.133     fetch
>> >     http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
>> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
>> >     200     11412
>> >     5300
>> > 11-18-2012 12:59:06.892     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
>> > gundemdetay/17.11.2012/1628402/default.htm
>> >     OK     11183
>> >     1
>> > 11-18-2012 12:59:02.772     document ingest
>> > (null)
>> >     http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
>> > /gundem/gundemdetay/18.11.2012/1628740/default.htm
>> >     OK     10632
>> >     1
>> > 11-18-2012 12:59:02.153     fetch
>> >     http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
>> > gundemdetay/17.11.2012/1628402/default.htm
>> >     200     11183
>> >     4720
>> > 11-18-2012 12:58:57.173     fetch
>> >     http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
>> > /gundem/gundemdetay/18.11.2012/1628740/default.htm
>> >     200     10632
>> >     5570
>> > 11-18-2012 12:58:52.533     robots parse
>> >     www.hurriyet.com.tr
>> >     SUCCESS     0
>> >     78
>> > 11-18-2012 12:58:52.511     robots parse
>> >     gundem.milliyet.com.tr
>> >     SUCCESS     0
>> >     70
>> > 11-18-2012 12:58:52.136     fetch
>> >     http://www.hurriyet.com.tr/robots.txt
>> >     200     928
>> >     476
>> > 11-18-2012 12:58:52.129     fetch
>> >     http://gundem.milliyet.com.tr/robots.txt
>> >     200     797
>> >     453
>> > 11-18-2012 12:58:49.013     fetch
>> >     http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>> >     200     34467
>> >     1080
>> > 11-18-2012 12:58:48.993     fetch
>> >     http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>> >     200     72439
>> >     1510
>> > 11-18-2012 12:58:44.513     robots parse
>> >     www.milliyet.com.tr
>> >     SUCCESS     0
>> >     340
>> > 11-18-2012 12:58:44.013     fetch
>> >     http://rss.hurriyet.com.tr/robots.txt
>> >     404     4096
>> >     770
>> > 11-18-2012 12:58:44.013     fetch
>> >     http://www.milliyet.com.tr/robots.txt
>> >     200     17484
>> >     840
>> > 11-18-2012 12:58:41.502     job start
>> >     1353261469661(rss)
>> >         0     1
>> >
>> > <<<<<<
>> >
>> > So it looks like there's a http://www.milliyet.com.tr/robots.txt that
>> > it fetched fine, and there is no
>> > http://rss.hurriyet.com.tr/robots.txt.  Does this
>> > seem correct to you?
>> >  Furthermore, there is content that the feed points at that
>> > requires
>> > access to (and robots fetches for) two other servers...
>> >
>> > Karl
>> >
>> > On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <daddywri@gmail.com>
>> > wrote:
>> > > Odd. The problem is obviously the port of -1. But the
>> > code does not
>> > > attach a specific port to the URL in that case.
>> > >
>> > > I will try your example exactly when I have access to
>> > internet again.
>> > >
>> > > Karl
>> > >
>> > > Sent from my Windows Phone
>> > > From: Ahmet Arslan
>> > > Sent: 11/17/2012 4:47 PM
>> > > To: dev@manifoldcf.apache.org
>> > > Subject: Re: Anyone out there using RSS connector, who
>> > wants to help?
>> > > Hi,
>> > >
>> > > Regarding  "WARN 2012-11-17 23:01:17,649 (Worker
>> > thread '31') -
>> > > Pre-ingest service interruption reported for job
>> > 1353185325276
>> > > connection 'rss': Couldn't fetch robots.txt from
>> > > http://www.milliyet.com.tr:-1"
>> > >
>> > > I see that http://www.milliyet.com.tr/robots.txt exists.
>> > >
>> > > Ahmet
>> > >
>> > > --- On Sat, 11/17/12, Ahmet Arslan <iorixxx@yahoo.com>
>> > wrote:
>> > >
>> > >> From: Ahmet Arslan <iorixxx@yahoo.com>
>> > >> Subject: Re: Anyone out there using RSS connector,
>> > who wants to help?
>> > >> To: dev@manifoldcf.apache.org
>> > >> Date: Saturday, November 17, 2012, 11:11 PM
>> > >> Hi Karl,
>> > >>
>> > >> Never used rss connector. But here is what I have
>> > done.
>> > >>
>> > >> I defined a job to crawl using mcf-trunk. mfc-trunk
>> > crawled
>> > >> following two URLs:
>> > >>
>> > >> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>> > >>
>> > >> With CONNECTORS-120 branch I can crawl
>> > >>
>> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>> > >>
>> > >> but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
>> > >> status of "Error: Repeated service interruptions -
>> > failure
>> > >> getting document version"
>> > >>
>> > >> I see these in the log file :
>> > >>
>> > >>  WARN 2012-11-17 23:01:17,649 (Worker thread
>> > '31') -
>> > >> Pre-ingest service interruption reported for job
>> > >> 1353185325276 connection 'rss': Couldn't fetch
>> > robots.txt
>> > >> from http://www.milliyet.com.tr:-1
>> > >> ERROR 2012-11-17 23:01:17,802 (Worker thread '31')
>> > -
>> > >> Exception tossed: Repeated service interruptions -
>> > failure
>> > >> getting document version
>> > >>
>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> > >> Repeated service interruptions - failure getting
>> > document
>> > >> version
>> > >>     at
>> > >>
>> >
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>> > >>  WARN 2012-11-17 23:02:27,307 (Worker thread
>> > '30') -
>> > >> Pre-ingest service interruption reported for job
>> > >> 1353185325276 connection 'rss': Couldn't fetch
>> > robots.txt
>> > >> from http://www.milliyet.com.tr:-1
>> > >> ERROR 2012-11-17 23:02:27,329 (Worker thread '30')
>> > -
>> > >> Exception tossed: Repeated service interruptions -
>> > failure
>> > >> getting document version
>> > >>
>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> > >> Repeated service interruptions - failure getting
>> > document
>> > >> version
>> > >>     at
>> > >>
>> >
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>> > >>
>> > >>
>> > >> By the way in "Dechromed Content" tab (Job Setting
>> > UI) I see
>> > >> four " "
>> > >>
>> > >> Thanks,
>> > >> Ahmet
>> > >> --- On Fri, 11/16/12, Karl Wright <daddywri@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > From: Karl Wright <daddywri@gmail.com>
>> > >> > Subject: Anyone out there using RSS connector,
>> > who
>> > >> wants to help?
>> > >> > To: "dev" <dev@manifoldcf.apache.org>
>> > >> > Date: Friday, November 16, 2012, 3:54 PM
>> > >> > Hi all,
>> > >> >
>> > >> > The branch
>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
>> > >> > contains an RSS connector that has been
>> > updated to use
>> > >> > httpcomponents
>> > >> > 4.2.2.  I'd love for people who are in a
>> > position to
>> > >> do
>> > >> > significant
>> > >> > RSS crawling to try it out before I pull it
>> > into
>> > >> > trunk.  Any takers?
>> > >> >
>> > >> > Karl
>> > >> >
>> > >>
>> >
>>

Mime
View raw message