manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend Garåsen <e.f.gara...@usit.uio.no>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
Date Thu, 18 Sep 2014 12:16:10 GMT

I tried to fetch documents by using curl from our prod server just in 
case a webmaster had blocked access. No problem. Maybe I should ask the 
webmaster of that host anyway, just to be sure.

The interrupted message may have been caused by an abort of that job.

I think I should just stop the problematic job and start all the other 
three remaining jobs instead. I bet they will all complete. Ideally we 
shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I 
have just contacted someone who is indexing Dspace resources. I guess a 
Dspace connector is a better approach.

Below you'll find some parameters.

REPOSITORY CONNECTION
---------------------
Throttling -> max connections: 30
Throttling -> Max fetches/min: 100
Bandwith -> max connections: 25
Bandwith -> max kbytes/sec: 8000
Bandwith -> max fetches/min: 20

JOB SETTINGS
------------

Hop filters: Keep forever

Seeds: https://www.duo.uio.no/

Exclude from crawl:
# Exclude some file types:
\.gif$
\.GIF$
\.jpeg$
\.JPEG$
\.jpg$
\.JPG$
\.png$
\.PNG$
\.mpg$
\.MPG$
\.mpeg$
\.MPEG$
\.exe$
\.bmp$
\.BMP$
\.mov$
\.MOV$
\.wmf$
\.css$
\.ico$
\.ICO$
\.mp2$
\.mp3$
\.mp4$
\.wmv$
\.tif$
\.tiff$
\.avi$
\.ogg$
\.ogv$
\.zip$
\.gz$
\.psd$

# TIKA-1011
\.mhtml$

# Exclude log files:
\.log$
\.logfile$

# Generelt, ikke tillatt indeksering av DUO-søkeresultater:
https?://www\.duo\.uio\.no/sok/search.*

# Andre elementer i DUO som skal ekskluderes:
https://www\.duo\.uio\.no.*open-search/description\.xml$
https://www\.duo\.uio\.no/(inn|login|feed|search|advanced-search|community-list|browse|password-login|inn|discover).*

# Skip locale settings - makes duplicates:
https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$

# Temporarily skip PDFs since we are indexing abstracts:
https://www\.duo\.uio\.no/bitstream/handle/.+

# skip full item record:
https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
# ny url-struktur:
https://www\.duo\.uio\.no/handle/.*\?show=full$

# Skip all navigations but "start with letter":
https://www\.duo\.uio\.no/.*type=(author|dateissued)$

# Skip search:
#https://www\.duo\.uio\.no/handle/.*/discover\?.*
https://www\.duo\.uio\.no/handle/.*search-filter\?.*
# ny url-struktur:
https://www\.duo\.uio\.no/discover\?.*
https://www\.duo\.uio\.no/search-filter\?.*

# Skip statistics:
https://www\.duo\.uio\.no/handle/.*/statistics$

Exclude from index:
# Exclude front page - no valuable info and we have QL:
https?://www\.duo\.uio\.no/$

# Do not index navigation, but follow:
https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
#ny url-struktur:
https://www\.duo\.uio\.no/handle/\d+/\d+/.+

# Exclude id's lower than four, probably category listening:
https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
# ny url-strultur:
https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$

Thanks for looking at this!

BTW: Within an hour, I will be away from my computer and cannot test 
anymore until Monday. I'm leaving Oslo for some days, but I will still 
be able to read and answer emails.

Erlend

On 18.09.14 13:43, Karl Wright wrote:
> Hi Erlend,
>
> The "Interrupted: null" message with a -104 code means only that the fetch
> was interrupted by something.  Unfortunately, the message is not clear
> about what the cause of the interruption is.  This is unrelated to
> Zookeeper; but I agree that it is suspicious that many such interruptions
> appear right after robots is parsed.
>
> One cause of a -104 is when the target server forcibly drops the
> connection, so an InterruptedIOException is thrown.  Having a look at the
> timestamps for the fetch messages, it looks believable that you might have
> exceeded some predetermined limit on that machine.  They're all within a
> few milliseconds of each other.  When a robots file needs to be read,
> ManifoldCF creates an event for that, and the urls blocked by that event
> will all be 'fetchable' as soon as the event is released.  Perhaps your
> throttling needs to be adjusted now that the rate limit bug has been fixed?
>
> I won't be able to work with this without at least your crawling parameters
> for the server in question.  I can ping that server so if you would like I
> can try crawling that server from here.
>
> For zookeeper, I would still try to either increase your tick count to
> maybe 10000, or better yet, find out why you periodically lose the ability
> to transmit pings from MCF to your zookeeper process.
>
> Thanks,
> Karl
>
>
>
>
> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>
> wrote:
>
>> On 18.09.14 13:00, Karl Wright wrote:
>>
>>> Hi Erlend,
>>>
>>> please can you also add the manifoldcf log as well?
>>>
>>
>> Yes, I will, but it includes entries from RC0 as well.
>>
>> MCF works perfectly using the other jobs for the other hosts. Take a look
>> at the following once again. MCF is being interrupted:
>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL|
>> https://www.duo.uio.no/|1411030940209+682605|-104|
>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C4096%7Corg.apache.manifoldcf.core.interfaces.ManifoldCFException%7C>
>> Interrupted: Interrupted: null
>>
>> You can find this entry near the other regarding the robots.txt file:
>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>>
>> Erlend
>>
>>
>


Mime
View raw message