manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Anyone out there using RSS connector, who wants to help?
Date Sun, 18 Nov 2012 20:11:38 GMT
The CONNECTORS-120 branch now also has a httpcomponents version of the
wiki connector implemented.  I think Maciej might be interested in
trying that one out.

Karl

On Sun, Nov 18, 2012 at 1:04 PM, Karl Wright <daddywri@gmail.com> wrote:
> Hi Ahmet,
>
> I tried your example, but it looked like it worked fine here.  Here's
> part of the simple history:
>
>>>>>>>
> 11-18-2012 12:59:52.182         document ingest (null)
>         http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
> ndem/gundemdetay/18.11.2012/1628733/default.htm
>         OK      16307   1
> 11-18-2012 12:59:47.482         document ingest (null)
>         http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
> gundemdetay/18.11.2012/1628657/default.htm
>         OK      10573   1
> 11-18-2012 12:59:47.133         fetch
>         http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
> ndem/gundemdetay/18.11.2012/1628733/default.htm
>         200     16307   5050
> 11-18-2012 12:59:42.133         fetch
>         http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
> gundemdetay/18.11.2012/1628657/default.htm
>         200     10573   5340
> 11-18-2012 12:59:42.092         document ingest (null)
>         http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
> sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
>         OK      10212   1
> 11-18-2012 12:59:37.252         document ingest (null)
>         http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
> nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
>         OK      16105   1
> 11-18-2012 12:59:37.133         fetch
>         http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
> sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
>         200     10212   4950
> 11-18-2012 12:59:32.332         document ingest (null)
>         http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
> m/gundemdetay/18.11.2012/1628801/default.htm
>         OK      10170   1
> 11-18-2012 12:59:32.133         fetch
>         http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
> nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
>         200     16105   5110
> 11-18-2012 12:59:27.142         document ingest (null)
>         http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
> ndemdetay/18.11.2012/1628661/default.htm
>         OK      10102   1
> 11-18-2012 12:59:27.133         fetch
>         http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
> m/gundemdetay/18.11.2012/1628801/default.htm
>         200     10170   5200
> 11-18-2012 12:59:22.182         document ingest (null)
>         http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
> gundemdetay/18.11.2012/1628824/default.htm
>         OK      10217   1
> 11-18-2012 12:59:22.133         fetch
>         http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
> ndemdetay/18.11.2012/1628661/default.htm
>         200     10102   4990
> 11-18-2012 12:59:18.062         document ingest (null)
>         http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
> /gundemdetay/18.11.2012/1628856/default.htm
>         OK      9721    1
> 11-18-2012 12:59:17.133         fetch
>         http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
> gundemdetay/18.11.2012/1628824/default.htm
>         200     10217   5050
> 11-18-2012 12:59:12.452         document ingest (null)
>         http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
> etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
>         OK      11412   1
> 11-18-2012 12:59:12.133         fetch
>         http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
> /gundemdetay/18.11.2012/1628856/default.htm
>         200     9721    5930
> 11-18-2012 12:59:07.133         fetch
>         http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
> etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
>         200     11412   5300
> 11-18-2012 12:59:06.892         document ingest (null)
>         http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
> gundemdetay/17.11.2012/1628402/default.htm
>         OK      11183   1
> 11-18-2012 12:59:02.772         document ingest (null)
>         http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
> /gundem/gundemdetay/18.11.2012/1628740/default.htm
>         OK      10632   1
> 11-18-2012 12:59:02.153         fetch
>         http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
> gundemdetay/17.11.2012/1628402/default.htm
>         200     11183   4720
> 11-18-2012 12:58:57.173         fetch
>         http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
> /gundem/gundemdetay/18.11.2012/1628740/default.htm
>         200     10632   5570
> 11-18-2012 12:58:52.533         robots parse    www.hurriyet.com.tr
>         SUCCESS         0       78
> 11-18-2012 12:58:52.511         robots parse    gundem.milliyet.com.tr
>         SUCCESS         0       70
> 11-18-2012 12:58:52.136         fetch   http://www.hurriyet.com.tr/robots.txt
>         200     928     476
> 11-18-2012 12:58:52.129         fetch   http://gundem.milliyet.com.tr/robots.txt
>         200     797     453
> 11-18-2012 12:58:49.013         fetch   http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>         200     34467   1080
> 11-18-2012 12:58:48.993         fetch   http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>         200     72439   1510
> 11-18-2012 12:58:44.513         robots parse    www.milliyet.com.tr
>         SUCCESS         0       340
> 11-18-2012 12:58:44.013         fetch   http://rss.hurriyet.com.tr/robots.txt
>         404     4096    770
> 11-18-2012 12:58:44.013         fetch   http://www.milliyet.com.tr/robots.txt
>         200     17484   840
> 11-18-2012 12:58:41.502         job start       1353261469661(rss)
>                 0       1
> <<<<<<
>
> So it looks like there's a http://www.milliyet.com.tr/robots.txt that
> it fetched fine, and there is no
> http://rss.hurriyet.com.tr/robots.txt.  Does this seem correct to you?
>  Furthermore, there is content that the feed points at that requires
> access to (and robots fetches for) two other servers...
>
> Karl
>
> On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <daddywri@gmail.com> wrote:
>> Odd. The problem is obviously the port of -1. But the code does not
>> attach a specific port to the URL in that case.
>>
>> I will try your example exactly when I have access to internet again.
>>
>> Karl
>>
>> Sent from my Windows Phone
>> From: Ahmet Arslan
>> Sent: 11/17/2012 4:47 PM
>> To: dev@manifoldcf.apache.org
>> Subject: Re: Anyone out there using RSS connector, who wants to help?
>> Hi,
>>
>> Regarding  "WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
>> Pre-ingest service interruption reported for job 1353185325276
>> connection 'rss': Couldn't fetch robots.txt from
>> http://www.milliyet.com.tr:-1"
>>
>> I see that http://www.milliyet.com.tr/robots.txt exists.
>>
>> Ahmet
>>
>> --- On Sat, 11/17/12, Ahmet Arslan <iorixxx@yahoo.com> wrote:
>>
>>> From: Ahmet Arslan <iorixxx@yahoo.com>
>>> Subject: Re: Anyone out there using RSS connector, who wants to help?
>>> To: dev@manifoldcf.apache.org
>>> Date: Saturday, November 17, 2012, 11:11 PM
>>> Hi Karl,
>>>
>>> Never used rss connector. But here is what I have done.
>>>
>>> I defined a job to crawl using mcf-trunk. mfc-trunk crawled
>>> following two URLs:
>>>
>>> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>>
>>> With CONNECTORS-120 branch I can crawl
>>>
>>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>>
>>> but  http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
>>> status of "Error: Repeated service interruptions - failure
>>> getting document version"
>>>
>>> I see these in the log file :
>>>
>>>  WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
>>> Pre-ingest service interruption reported for job
>>> 1353185325276 connection 'rss': Couldn't fetch robots.txt
>>> from http://www.milliyet.com.tr:-1
>>> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') -
>>> Exception tossed: Repeated service interruptions - failure
>>> getting document version
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>> Repeated service interruptions - failure getting document
>>> version
>>>     at
>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>>>  WARN 2012-11-17 23:02:27,307 (Worker thread '30') -
>>> Pre-ingest service interruption reported for job
>>> 1353185325276 connection 'rss': Couldn't fetch robots.txt
>>> from http://www.milliyet.com.tr:-1
>>> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') -
>>> Exception tossed: Repeated service interruptions - failure
>>> getting document version
>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>> Repeated service interruptions - failure getting document
>>> version
>>>     at
>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>>>
>>>
>>> By the way in "Dechromed Content" tab (Job Setting UI) I see
>>> four "&nbsp;"
>>>
>>> Thanks,
>>> Ahmet
>>> --- On Fri, 11/16/12, Karl Wright <daddywri@gmail.com>
>>> wrote:
>>>
>>> > From: Karl Wright <daddywri@gmail.com>
>>> > Subject: Anyone out there using RSS connector, who
>>> wants to help?
>>> > To: "dev" <dev@manifoldcf.apache.org>
>>> > Date: Friday, November 16, 2012, 3:54 PM
>>> > Hi all,
>>> >
>>> > The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
>>> > contains an RSS connector that has been updated to use
>>> > httpcomponents
>>> > 4.2.2.  I'd love for people who are in a position to
>>> do
>>> > significant
>>> > RSS crawling to try it out before I pull it into
>>> > trunk.  Any takers?
>>> >
>>> > Karl
>>> >
>>>

Mime
View raw message