manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
Date Thu, 18 Sep 2014 20:35:25 GMT
Ok, I started this crawl.  It fetched and processed robots.txt perfectly.
And then I saw the following: lots of fetches of fairly good-sized
documents, with very few ingestions.  The documents that did not ingest
look like this:

https://www.duo.uio.no/handle/10852/163/discover?order=DESC&r...pp=100&sort_by=dc.date.issued_dt


I think your index inclusion rules may be excluding most of the content.

Karl



On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <daddywri@gmail.com> wrote:

> Thanks -- I will probably not be able to get to this further until tonight
> anyhow.
>
> Karl
>
> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>
> wrote:
>
>>
>> I tried to fetch documents by using curl from our prod server just in
>> case a webmaster had blocked access. No problem. Maybe I should ask the
>> webmaster of that host anyway, just to be sure.
>>
>> The interrupted message may have been caused by an abort of that job.
>>
>> I think I should just stop the problematic job and start all the other
>> three remaining jobs instead. I bet they will all complete. Ideally we
>> shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I
>> have just contacted someone who is indexing Dspace resources. I guess a
>> Dspace connector is a better approach.
>>
>> Below you'll find some parameters.
>>
>> REPOSITORY CONNECTION
>> ---------------------
>> Throttling -> max connections: 30
>> Throttling -> Max fetches/min: 100
>> Bandwith -> max connections: 25
>> Bandwith -> max kbytes/sec: 8000
>> Bandwith -> max fetches/min: 20
>>
>> JOB SETTINGS
>> ------------
>>
>> Hop filters: Keep forever
>>
>> Seeds: https://www.duo.uio.no/
>>
>> Exclude from crawl:
>> # Exclude some file types:
>> \.gif$
>> \.GIF$
>> \.jpeg$
>> \.JPEG$
>> \.jpg$
>> \.JPG$
>> \.png$
>> \.PNG$
>> \.mpg$
>> \.MPG$
>> \.mpeg$
>> \.MPEG$
>> \.exe$
>> \.bmp$
>> \.BMP$
>> \.mov$
>> \.MOV$
>> \.wmf$
>> \.css$
>> \.ico$
>> \.ICO$
>> \.mp2$
>> \.mp3$
>> \.mp4$
>> \.wmv$
>> \.tif$
>> \.tiff$
>> \.avi$
>> \.ogg$
>> \.ogv$
>> \.zip$
>> \.gz$
>> \.psd$
>>
>> # TIKA-1011
>> \.mhtml$
>>
>> # Exclude log files:
>> \.log$
>> \.logfile$
>>
>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
>> https?://www\.duo\.uio\.no/sok/search.*
>>
>> # Andre elementer i DUO som skal ekskluderes:
>> https://www\.duo\.uio\.no.*open-search/description\.xml$
>> https://www\.duo\.uio\.no/(inn|login|feed|search|
>> advanced-search|community-list|browse|password-login|inn|discover).*
>>
>> # Skip locale settings - makes duplicates:
>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
>>
>> # Temporarily skip PDFs since we are indexing abstracts:
>> https://www\.duo\.uio\.no/bitstream/handle/.+
>>
>> # skip full item record:
>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
>> # ny url-struktur:
>> https://www\.duo\.uio\.no/handle/.*\?show=full$
>>
>> # Skip all navigations but "start with letter":
>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
>>
>> # Skip search:
>> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
>> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
>> # ny url-struktur:
>> https://www\.duo\.uio\.no/discover\?.*
>> https://www\.duo\.uio\.no/search-filter\?.*
>>
>> # Skip statistics:
>> https://www\.duo\.uio\.no/handle/.*/statistics$
>>
>> Exclude from index:
>> # Exclude front page - no valuable info and we have QL:
>> https?://www\.duo\.uio\.no/$
>>
>> # Do not index navigation, but follow:
>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
>> #ny url-struktur:
>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
>>
>> # Exclude id's lower than four, probably category listening:
>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
>> # ny url-strultur:
>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
>>
>> Thanks for looking at this!
>>
>> BTW: Within an hour, I will be away from my computer and cannot test
>> anymore until Monday. I'm leaving Oslo for some days, but I will still be
>> able to read and answer emails.
>>
>> Erlend
>>
>>
>> On 18.09.14 13:43, Karl Wright wrote:
>>
>>> Hi Erlend,
>>>
>>> The "Interrupted: null" message with a -104 code means only that the
>>> fetch
>>> was interrupted by something.  Unfortunately, the message is not clear
>>> about what the cause of the interruption is.  This is unrelated to
>>> Zookeeper; but I agree that it is suspicious that many such interruptions
>>> appear right after robots is parsed.
>>>
>>> One cause of a -104 is when the target server forcibly drops the
>>> connection, so an InterruptedIOException is thrown.  Having a look at the
>>> timestamps for the fetch messages, it looks believable that you might
>>> have
>>> exceeded some predetermined limit on that machine.  They're all within a
>>> few milliseconds of each other.  When a robots file needs to be read,
>>> ManifoldCF creates an event for that, and the urls blocked by that event
>>> will all be 'fetchable' as soon as the event is released.  Perhaps your
>>> throttling needs to be adjusted now that the rate limit bug has been
>>> fixed?
>>>
>>> I won't be able to work with this without at least your crawling
>>> parameters
>>> for the server in question.  I can ping that server so if you would like
>>> I
>>> can try crawling that server from here.
>>>
>>> For zookeeper, I would still try to either increase your tick count to
>>> maybe 10000, or better yet, find out why you periodically lose the
>>> ability
>>> to transmit pings from MCF to your zookeeper process.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <e.f.garasen@usit.uio.no
>>> >
>>> wrote:
>>>
>>>  On 18.09.14 13:00, Karl Wright wrote:
>>>>
>>>>  Hi Erlend,
>>>>>
>>>>> please can you also add the manifoldcf log as well?
>>>>>
>>>>>
>>>> Yes, I will, but it includes entries from RC0 as well.
>>>>
>>>> MCF works perfectly using the other jobs for the other hosts. Take a
>>>> look
>>>> at the following once again. MCF is being interrupted:
>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL|
>>>> https://www.duo.uio.no/|1411030940209+682605|-104|
>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
>>>> 7C4096%7Corg.apache.manifoldcf.core.interfaces.ManifoldCFException%7C>
>>>> Interrupted: Interrupted: null
>>>>
>>>> You can find this entry near the other regarding the robots.txt file:
>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>>>>
>>>> Erlend
>>>>
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message