manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Anyone out there using RSS connector, who wants to help?
Date Sat, 24 Nov 2012 12:04:37 GMT
I have just reworked the SharePoint connector in
branches/CONNECTORS-120 somewhat, to stream rather than copy through
an intermediate file.  While I don't expect any change in behavior, it
would be good to confirm I didn't do anything stupid, so another
sample crawl would be very welcome.

Thanks!
Karl

On Tue, Nov 20, 2012 at 7:11 AM, Karl Wright <daddywri@gmail.com> wrote:
> Thanks for the update!
>
> I'm working on the web connector now.  That's going to require a bit more work.
>
> Karl
>
> On Tue, Nov 20, 2012 at 7:09 AM, Maciej Li┼╝ewski
> <maciej.lizewski@gmail.com> wrote:
>> CONNECTORS-120 is already merged to trunk as I see. Tested wiki connector
>> in my environment and works correctly.
>>
>>
>> 2012/11/19 Ahmet Arslan <iorixxx@yahoo.com>
>>
>>> Hi Karl,
>>>
>>> I re-ran experiments with r1411016 and both RSS URLs are working now with
>>> CONNECTORS-120.
>>>
>>> Regarding robots.txt, http://rss.hurriyet.com.tr/robots.txt does not
>>> exists but http://www.hurriyet.com.tr/robots.txt exists.
>>>
>>> Ahmet
>>>
>>> --- On Sun, 11/18/12, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>> > From: Karl Wright <daddywri@gmail.com>
>>> > Subject: Re: Anyone out there using RSS connector, who wants to help?
>>> > To: "Ahmet Arslan" <iorixxx@yahoo.com>, "dev@manifoldcf.apache.org"
<
>>> dev@manifoldcf.apache.org>
>>> > Date: Sunday, November 18, 2012, 8:04 PM
>>> > Hi Ahmet,
>>> >
>>> > I tried your example, but it looked like it worked fine
>>> > here.  Here's
>>> > part of the simple history:
>>> >
>>> > >>>>>>
>>> > 11-18-2012 12:59:52.182     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
>>> > ndem/gundemdetay/18.11.2012/1628733/default.htm
>>> >     OK     16307
>>> >     1
>>> > 11-18-2012 12:59:47.482     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
>>> > gundemdetay/18.11.2012/1628657/default.htm
>>> >     OK     10573
>>> >     1
>>> > 11-18-2012 12:59:47.133     fetch
>>> >     http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
>>> > ndem/gundemdetay/18.11.2012/1628733/default.htm
>>> >     200     16307
>>> >     5050
>>> > 11-18-2012 12:59:42.133     fetch
>>> >     http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
>>> > gundemdetay/18.11.2012/1628657/default.htm
>>> >     200     10573
>>> >     5340
>>> > 11-18-2012 12:59:42.092     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
>>> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
>>> >     OK     10212
>>> >     1
>>> > 11-18-2012 12:59:37.252     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
>>> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
>>> >     OK     16105
>>> >     1
>>> > 11-18-2012 12:59:37.133     fetch
>>> >     http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
>>> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
>>> >     200     10212
>>> >     4950
>>> > 11-18-2012 12:59:32.332     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
>>> > m/gundemdetay/18.11.2012/1628801/default.htm
>>> >     OK     10170
>>> >     1
>>> > 11-18-2012 12:59:32.133     fetch
>>> >     http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
>>> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
>>> >     200     16105
>>> >     5110
>>> > 11-18-2012 12:59:27.142     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
>>> > ndemdetay/18.11.2012/1628661/default.htm
>>> >     OK     10102
>>> >     1
>>> > 11-18-2012 12:59:27.133     fetch
>>> >     http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
>>> > m/gundemdetay/18.11.2012/1628801/default.htm
>>> >     200     10170
>>> >     5200
>>> > 11-18-2012 12:59:22.182     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
>>> > gundemdetay/18.11.2012/1628824/default.htm
>>> >     OK     10217
>>> >     1
>>> > 11-18-2012 12:59:22.133     fetch
>>> >     http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
>>> > ndemdetay/18.11.2012/1628661/default.htm
>>> >     200     10102
>>> >     4990
>>> > 11-18-2012 12:59:18.062     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
>>> > /gundemdetay/18.11.2012/1628856/default.htm
>>> >     OK     9721
>>> >     1
>>> > 11-18-2012 12:59:17.133     fetch
>>> >     http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
>>> > gundemdetay/18.11.2012/1628824/default.htm
>>> >     200     10217
>>> >     5050
>>> > 11-18-2012 12:59:12.452     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
>>> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
>>> >     OK     11412
>>> >     1
>>> > 11-18-2012 12:59:12.133     fetch
>>> >     http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
>>> > /gundemdetay/18.11.2012/1628856/default.htm
>>> >     200     9721
>>> >     5930
>>> > 11-18-2012 12:59:07.133     fetch
>>> >     http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
>>> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
>>> >     200     11412
>>> >     5300
>>> > 11-18-2012 12:59:06.892     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
>>> > gundemdetay/17.11.2012/1628402/default.htm
>>> >     OK     11183
>>> >     1
>>> > 11-18-2012 12:59:02.772     document ingest
>>> > (null)
>>> >     http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
>>> > /gundem/gundemdetay/18.11.2012/1628740/default.htm
>>> >     OK     10632
>>> >     1
>>> > 11-18-2012 12:59:02.153     fetch
>>> >     http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
>>> > gundemdetay/17.11.2012/1628402/default.htm
>>> >     200     11183
>>> >     4720
>>> > 11-18-2012 12:58:57.173     fetch
>>> >     http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
>>> > /gundem/gundemdetay/18.11.2012/1628740/default.htm
>>> >     200     10632
>>> >     5570
>>> > 11-18-2012 12:58:52.533     robots parse
>>> >     www.hurriyet.com.tr
>>> >     SUCCESS     0
>>> >     78
>>> > 11-18-2012 12:58:52.511     robots parse
>>> >     gundem.milliyet.com.tr
>>> >     SUCCESS     0
>>> >     70
>>> > 11-18-2012 12:58:52.136     fetch
>>> >     http://www.hurriyet.com.tr/robots.txt
>>> >     200     928
>>> >     476
>>> > 11-18-2012 12:58:52.129     fetch
>>> >     http://gundem.milliyet.com.tr/robots.txt
>>> >     200     797
>>> >     453
>>> > 11-18-2012 12:58:49.013     fetch
>>> >     http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>> >     200     34467
>>> >     1080
>>> > 11-18-2012 12:58:48.993     fetch
>>> >     http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>>> >     200     72439
>>> >     1510
>>> > 11-18-2012 12:58:44.513     robots parse
>>> >     www.milliyet.com.tr
>>> >     SUCCESS     0
>>> >     340
>>> > 11-18-2012 12:58:44.013     fetch
>>> >     http://rss.hurriyet.com.tr/robots.txt
>>> >     404     4096
>>> >     770
>>> > 11-18-2012 12:58:44.013     fetch
>>> >     http://www.milliyet.com.tr/robots.txt
>>> >     200     17484
>>> >     840
>>> > 11-18-2012 12:58:41.502     job start
>>> >     1353261469661(rss)
>>> >         0     1
>>> >
>>> > <<<<<<
>>> >
>>> > So it looks like there's a http://www.milliyet.com.tr/robots.txt that
>>> > it fetched fine, and there is no
>>> > http://rss.hurriyet.com.tr/robots.txt.  Does this
>>> > seem correct to you?
>>> >  Furthermore, there is content that the feed points at that
>>> > requires
>>> > access to (and robots fetches for) two other servers...
>>> >
>>> > Karl
>>> >
>>> > On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <daddywri@gmail.com>
>>> > wrote:
>>> > > Odd. The problem is obviously the port of -1. But the
>>> > code does not
>>> > > attach a specific port to the URL in that case.
>>> > >
>>> > > I will try your example exactly when I have access to
>>> > internet again.
>>> > >
>>> > > Karl
>>> > >
>>> > > Sent from my Windows Phone
>>> > > From: Ahmet Arslan
>>> > > Sent: 11/17/2012 4:47 PM
>>> > > To: dev@manifoldcf.apache.org
>>> > > Subject: Re: Anyone out there using RSS connector, who
>>> > wants to help?
>>> > > Hi,
>>> > >
>>> > > Regarding  "WARN 2012-11-17 23:01:17,649 (Worker
>>> > thread '31') -
>>> > > Pre-ingest service interruption reported for job
>>> > 1353185325276
>>> > > connection 'rss': Couldn't fetch robots.txt from
>>> > > http://www.milliyet.com.tr:-1"
>>> > >
>>> > > I see that http://www.milliyet.com.tr/robots.txt exists.
>>> > >
>>> > > Ahmet
>>> > >
>>> > > --- On Sat, 11/17/12, Ahmet Arslan <iorixxx@yahoo.com>
>>> > wrote:
>>> > >
>>> > >> From: Ahmet Arslan <iorixxx@yahoo.com>
>>> > >> Subject: Re: Anyone out there using RSS connector,
>>> > who wants to help?
>>> > >> To: dev@manifoldcf.apache.org
>>> > >> Date: Saturday, November 17, 2012, 11:11 PM
>>> > >> Hi Karl,
>>> > >>
>>> > >> Never used rss connector. But here is what I have
>>> > done.
>>> > >>
>>> > >> I defined a job to crawl using mcf-trunk. mfc-trunk
>>> > crawled
>>> > >> following two URLs:
>>> > >>
>>> > >> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>>> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>> > >>
>>> > >> With CONNECTORS-120 branch I can crawl
>>> > >>
>>> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>> > >>
>>> > >> but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
>>> > >> status of "Error: Repeated service interruptions -
>>> > failure
>>> > >> getting document version"
>>> > >>
>>> > >> I see these in the log file :
>>> > >>
>>> > >>  WARN 2012-11-17 23:01:17,649 (Worker thread
>>> > '31') -
>>> > >> Pre-ingest service interruption reported for job
>>> > >> 1353185325276 connection 'rss': Couldn't fetch
>>> > robots.txt
>>> > >> from http://www.milliyet.com.tr:-1
>>> > >> ERROR 2012-11-17 23:01:17,802 (Worker thread '31')
>>> > -
>>> > >> Exception tossed: Repeated service interruptions -
>>> > failure
>>> > >> getting document version
>>> > >>
>>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>> > >> Repeated service interruptions - failure getting
>>> > document
>>> > >> version
>>> > >>     at
>>> > >>
>>> >
>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>>> > >>  WARN 2012-11-17 23:02:27,307 (Worker thread
>>> > '30') -
>>> > >> Pre-ingest service interruption reported for job
>>> > >> 1353185325276 connection 'rss': Couldn't fetch
>>> > robots.txt
>>> > >> from http://www.milliyet.com.tr:-1
>>> > >> ERROR 2012-11-17 23:02:27,329 (Worker thread '30')
>>> > -
>>> > >> Exception tossed: Repeated service interruptions -
>>> > failure
>>> > >> getting document version
>>> > >>
>>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>> > >> Repeated service interruptions - failure getting
>>> > document
>>> > >> version
>>> > >>     at
>>> > >>
>>> >
>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>>> > >>
>>> > >>
>>> > >> By the way in "Dechromed Content" tab (Job Setting
>>> > UI) I see
>>> > >> four " "
>>> > >>
>>> > >> Thanks,
>>> > >> Ahmet
>>> > >> --- On Fri, 11/16/12, Karl Wright <daddywri@gmail.com>
>>> > >> wrote:
>>> > >>
>>> > >> > From: Karl Wright <daddywri@gmail.com>
>>> > >> > Subject: Anyone out there using RSS connector,
>>> > who
>>> > >> wants to help?
>>> > >> > To: "dev" <dev@manifoldcf.apache.org>
>>> > >> > Date: Friday, November 16, 2012, 3:54 PM
>>> > >> > Hi all,
>>> > >> >
>>> > >> > The branch
>>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
>>> > >> > contains an RSS connector that has been
>>> > updated to use
>>> > >> > httpcomponents
>>> > >> > 4.2.2.  I'd love for people who are in a
>>> > position to
>>> > >> do
>>> > >> > significant
>>> > >> > RSS crawling to try it out before I pull it
>>> > into
>>> > >> > trunk.  Any takers?
>>> > >> >
>>> > >> > Karl
>>> > >> >
>>> > >>
>>> >
>>>

Mime
View raw message