manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend Garåsen <>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
Date Thu, 18 Sep 2014 12:16:10 GMT

I tried to fetch documents by using curl from our prod server just in 
case a webmaster had blocked access. No problem. Maybe I should ask the 
webmaster of that host anyway, just to be sure.

The interrupted message may have been caused by an abort of that job.

I think I should just stop the problematic job and start all the other 
three remaining jobs instead. I bet they will all complete. Ideally we 
shouldn't crawl at all since it's a Dspace resource. I 
have just contacted someone who is indexing Dspace resources. I guess a 
Dspace connector is a better approach.

Below you'll find some parameters.

Throttling -> max connections: 30
Throttling -> Max fetches/min: 100
Bandwith -> max connections: 25
Bandwith -> max kbytes/sec: 8000
Bandwith -> max fetches/min: 20


Hop filters: Keep forever


Exclude from crawl:
# Exclude some file types:

# TIKA-1011

# Exclude log files:

# Generelt, ikke tillatt indeksering av DUO-søkeresultater:

# Andre elementer i DUO som skal ekskluderes:

# Skip locale settings - makes duplicates:

# Temporarily skip PDFs since we are indexing abstracts:

# skip full item record:
# ny url-struktur:

# Skip all navigations but "start with letter":

# Skip search:
# ny url-struktur:

# Skip statistics:

Exclude from index:
# Exclude front page - no valuable info and we have QL:

# Do not index navigation, but follow:
#ny url-struktur:

# Exclude id's lower than four, probably category listening:
# ny url-strultur:

Thanks for looking at this!

BTW: Within an hour, I will be away from my computer and cannot test 
anymore until Monday. I'm leaving Oslo for some days, but I will still 
be able to read and answer emails.


On 18.09.14 13:43, Karl Wright wrote:
> Hi Erlend,
> The "Interrupted: null" message with a -104 code means only that the fetch
> was interrupted by something.  Unfortunately, the message is not clear
> about what the cause of the interruption is.  This is unrelated to
> Zookeeper; but I agree that it is suspicious that many such interruptions
> appear right after robots is parsed.
> One cause of a -104 is when the target server forcibly drops the
> connection, so an InterruptedIOException is thrown.  Having a look at the
> timestamps for the fetch messages, it looks believable that you might have
> exceeded some predetermined limit on that machine.  They're all within a
> few milliseconds of each other.  When a robots file needs to be read,
> ManifoldCF creates an event for that, and the urls blocked by that event
> will all be 'fetchable' as soon as the event is released.  Perhaps your
> throttling needs to be adjusted now that the rate limit bug has been fixed?
> I won't be able to work with this without at least your crawling parameters
> for the server in question.  I can ping that server so if you would like I
> can try crawling that server from here.
> For zookeeper, I would still try to either increase your tick count to
> maybe 10000, or better yet, find out why you periodically lose the ability
> to transmit pings from MCF to your zookeeper process.
> Thanks,
> Karl
> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <>
> wrote:
>> On 18.09.14 13:00, Karl Wright wrote:
>>> Hi Erlend,
>>> please can you also add the manifoldcf log as well?
>> Yes, I will, but it includes entries from RC0 as well.
>> MCF works perfectly using the other jobs for the other hosts. Take a look
>> at the following once again. MCF is being interrupted:
>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL|
>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>> <>
>> Interrupted: Interrupted: null
>> You can find this entry near the other regarding the robots.txt file:
>> Erlend

View raw message