manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
Date Thu, 18 Sep 2014 12:48:42 GMT
Thanks -- I will probably not be able to get to this further until tonight
anyhow.

Karl

On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>
wrote:

>
> I tried to fetch documents by using curl from our prod server just in case
> a webmaster had blocked access. No problem. Maybe I should ask the
> webmaster of that host anyway, just to be sure.
>
> The interrupted message may have been caused by an abort of that job.
>
> I think I should just stop the problematic job and start all the other
> three remaining jobs instead. I bet they will all complete. Ideally we
> shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I
> have just contacted someone who is indexing Dspace resources. I guess a
> Dspace connector is a better approach.
>
> Below you'll find some parameters.
>
> REPOSITORY CONNECTION
> ---------------------
> Throttling -> max connections: 30
> Throttling -> Max fetches/min: 100
> Bandwith -> max connections: 25
> Bandwith -> max kbytes/sec: 8000
> Bandwith -> max fetches/min: 20
>
> JOB SETTINGS
> ------------
>
> Hop filters: Keep forever
>
> Seeds: https://www.duo.uio.no/
>
> Exclude from crawl:
> # Exclude some file types:
> \.gif$
> \.GIF$
> \.jpeg$
> \.JPEG$
> \.jpg$
> \.JPG$
> \.png$
> \.PNG$
> \.mpg$
> \.MPG$
> \.mpeg$
> \.MPEG$
> \.exe$
> \.bmp$
> \.BMP$
> \.mov$
> \.MOV$
> \.wmf$
> \.css$
> \.ico$
> \.ICO$
> \.mp2$
> \.mp3$
> \.mp4$
> \.wmv$
> \.tif$
> \.tiff$
> \.avi$
> \.ogg$
> \.ogv$
> \.zip$
> \.gz$
> \.psd$
>
> # TIKA-1011
> \.mhtml$
>
> # Exclude log files:
> \.log$
> \.logfile$
>
> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
> https?://www\.duo\.uio\.no/sok/search.*
>
> # Andre elementer i DUO som skal ekskluderes:
> https://www\.duo\.uio\.no.*open-search/description\.xml$
> https://www\.duo\.uio\.no/(inn|login|feed|search|
> advanced-search|community-list|browse|password-login|inn|discover).*
>
> # Skip locale settings - makes duplicates:
> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
>
> # Temporarily skip PDFs since we are indexing abstracts:
> https://www\.duo\.uio\.no/bitstream/handle/.+
>
> # skip full item record:
> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
> # ny url-struktur:
> https://www\.duo\.uio\.no/handle/.*\?show=full$
>
> # Skip all navigations but "start with letter":
> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
>
> # Skip search:
> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
> # ny url-struktur:
> https://www\.duo\.uio\.no/discover\?.*
> https://www\.duo\.uio\.no/search-filter\?.*
>
> # Skip statistics:
> https://www\.duo\.uio\.no/handle/.*/statistics$
>
> Exclude from index:
> # Exclude front page - no valuable info and we have QL:
> https?://www\.duo\.uio\.no/$
>
> # Do not index navigation, but follow:
> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
> #ny url-struktur:
> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
>
> # Exclude id's lower than four, probably category listening:
> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
> # ny url-strultur:
> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
>
> Thanks for looking at this!
>
> BTW: Within an hour, I will be away from my computer and cannot test
> anymore until Monday. I'm leaving Oslo for some days, but I will still be
> able to read and answer emails.
>
> Erlend
>
>
> On 18.09.14 13:43, Karl Wright wrote:
>
>> Hi Erlend,
>>
>> The "Interrupted: null" message with a -104 code means only that the fetch
>> was interrupted by something.  Unfortunately, the message is not clear
>> about what the cause of the interruption is.  This is unrelated to
>> Zookeeper; but I agree that it is suspicious that many such interruptions
>> appear right after robots is parsed.
>>
>> One cause of a -104 is when the target server forcibly drops the
>> connection, so an InterruptedIOException is thrown.  Having a look at the
>> timestamps for the fetch messages, it looks believable that you might have
>> exceeded some predetermined limit on that machine.  They're all within a
>> few milliseconds of each other.  When a robots file needs to be read,
>> ManifoldCF creates an event for that, and the urls blocked by that event
>> will all be 'fetchable' as soon as the event is released.  Perhaps your
>> throttling needs to be adjusted now that the rate limit bug has been
>> fixed?
>>
>> I won't be able to work with this without at least your crawling
>> parameters
>> for the server in question.  I can ping that server so if you would like I
>> can try crawling that server from here.
>>
>> For zookeeper, I would still try to either increase your tick count to
>> maybe 10000, or better yet, find out why you periodically lose the ability
>> to transmit pings from MCF to your zookeeper process.
>>
>> Thanks,
>> Karl
>>
>>
>>
>>
>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>
>> wrote:
>>
>>  On 18.09.14 13:00, Karl Wright wrote:
>>>
>>>  Hi Erlend,
>>>>
>>>> please can you also add the manifoldcf log as well?
>>>>
>>>>
>>> Yes, I will, but it includes entries from RC0 as well.
>>>
>>> MCF works perfectly using the other jobs for the other hosts. Take a look
>>> at the following once again. MCF is being interrupted:
>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL|
>>> https://www.duo.uio.no/|1411030940209+682605|-104|
>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
>>> 7C4096%7Corg.apache.manifoldcf.core.interfaces.ManifoldCFException%7C>
>>> Interrupted: Interrupted: null
>>>
>>> You can find this entry near the other regarding the robots.txt file:
>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>>>
>>> Erlend
>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message