manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend Fedt Garåsen <e.f.gara...@usit.uio.no>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
Date Thu, 18 Sep 2014 21:29:51 GMT
I know. I used a lot of time to create the rules which seems to index what we really want.
Your observation is correct. Crawling Dspace repositories are very difficult. A lot of nonsense
pages we need to filter out.

We have crawled this host the last two years using file based synch.

I'm planning a new approach, i.e. using a connector etc.

E

Sent from my iPhone

> On 18. sep. 2014, at 22:35, "Karl Wright" <daddywri@gmail.com> wrote:
> 
> Ok, I started this crawl.  It fetched and processed robots.txt perfectly.
> And then I saw the following: lots of fetches of fairly good-sized
> documents, with very few ingestions.  The documents that did not ingest
> look like this:
> 
> https://www.duo.uio.no/handle/10852/163/discover?order=DESC&r...pp=100&sort_by=dc.date.issued_dt
> 
> 
> I think your index inclusion rules may be excluding most of the content.
> 
> Karl
> 
> 
> 
>> On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <daddywri@gmail.com> wrote:
>> 
>> Thanks -- I will probably not be able to get to this further until tonight
>> anyhow.
>> 
>> Karl
>> 
>> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>
>> wrote:
>> 
>>> 
>>> I tried to fetch documents by using curl from our prod server just in
>>> case a webmaster had blocked access. No problem. Maybe I should ask the
>>> webmaster of that host anyway, just to be sure.
>>> 
>>> The interrupted message may have been caused by an abort of that job.
>>> 
>>> I think I should just stop the problematic job and start all the other
>>> three remaining jobs instead. I bet they will all complete. Ideally we
>>> shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I
>>> have just contacted someone who is indexing Dspace resources. I guess a
>>> Dspace connector is a better approach.
>>> 
>>> Below you'll find some parameters.
>>> 
>>> REPOSITORY CONNECTION
>>> ---------------------
>>> Throttling -> max connections: 30
>>> Throttling -> Max fetches/min: 100
>>> Bandwith -> max connections: 25
>>> Bandwith -> max kbytes/sec: 8000
>>> Bandwith -> max fetches/min: 20
>>> 
>>> JOB SETTINGS
>>> ------------
>>> 
>>> Hop filters: Keep forever
>>> 
>>> Seeds: https://www.duo.uio.no/
>>> 
>>> Exclude from crawl:
>>> # Exclude some file types:
>>> \.gif$
>>> \.GIF$
>>> \.jpeg$
>>> \.JPEG$
>>> \.jpg$
>>> \.JPG$
>>> \.png$
>>> \.PNG$
>>> \.mpg$
>>> \.MPG$
>>> \.mpeg$
>>> \.MPEG$
>>> \.exe$
>>> \.bmp$
>>> \.BMP$
>>> \.mov$
>>> \.MOV$
>>> \.wmf$
>>> \.css$
>>> \.ico$
>>> \.ICO$
>>> \.mp2$
>>> \.mp3$
>>> \.mp4$
>>> \.wmv$
>>> \.tif$
>>> \.tiff$
>>> \.avi$
>>> \.ogg$
>>> \.ogv$
>>> \.zip$
>>> \.gz$
>>> \.psd$
>>> 
>>> # TIKA-1011
>>> \.mhtml$
>>> 
>>> # Exclude log files:
>>> \.log$
>>> \.logfile$
>>> 
>>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
>>> https?://www\.duo\.uio\.no/sok/search.*
>>> 
>>> # Andre elementer i DUO som skal ekskluderes:
>>> https://www\.duo\.uio\.no.*open-search/description\.xml$
>>> https://www\.duo\.uio\.no/(inn|login|feed|search|
>>> advanced-search|community-list|browse|password-login|inn|discover).*
>>> 
>>> # Skip locale settings - makes duplicates:
>>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
>>> 
>>> # Temporarily skip PDFs since we are indexing abstracts:
>>> https://www\.duo\.uio\.no/bitstream/handle/.+
>>> 
>>> # skip full item record:
>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
>>> # ny url-struktur:
>>> https://www\.duo\.uio\.no/handle/.*\?show=full$
>>> 
>>> # Skip all navigations but "start with letter":
>>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
>>> 
>>> # Skip search:
>>> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
>>> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
>>> # ny url-struktur:
>>> https://www\.duo\.uio\.no/discover\?.*
>>> https://www\.duo\.uio\.no/search-filter\?.*
>>> 
>>> # Skip statistics:
>>> https://www\.duo\.uio\.no/handle/.*/statistics$
>>> 
>>> Exclude from index:
>>> # Exclude front page - no valuable info and we have QL:
>>> https?://www\.duo\.uio\.no/$
>>> 
>>> # Do not index navigation, but follow:
>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
>>> #ny url-struktur:
>>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
>>> 
>>> # Exclude id's lower than four, probably category listening:
>>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
>>> # ny url-strultur:
>>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
>>> 
>>> Thanks for looking at this!
>>> 
>>> BTW: Within an hour, I will be away from my computer and cannot test
>>> anymore until Monday. I'm leaving Oslo for some days, but I will still be
>>> able to read and answer emails.
>>> 
>>> Erlend
>>> 
>>> 
>>>> On 18.09.14 13:43, Karl Wright wrote:
>>>> 
>>>> Hi Erlend,
>>>> 
>>>> The "Interrupted: null" message with a -104 code means only that the
>>>> fetch
>>>> was interrupted by something.  Unfortunately, the message is not clear
>>>> about what the cause of the interruption is.  This is unrelated to
>>>> Zookeeper; but I agree that it is suspicious that many such interruptions
>>>> appear right after robots is parsed.
>>>> 
>>>> One cause of a -104 is when the target server forcibly drops the
>>>> connection, so an InterruptedIOException is thrown.  Having a look at the
>>>> timestamps for the fetch messages, it looks believable that you might
>>>> have
>>>> exceeded some predetermined limit on that machine.  They're all within a
>>>> few milliseconds of each other.  When a robots file needs to be read,
>>>> ManifoldCF creates an event for that, and the urls blocked by that event
>>>> will all be 'fetchable' as soon as the event is released.  Perhaps your
>>>> throttling needs to be adjusted now that the rate limit bug has been
>>>> fixed?
>>>> 
>>>> I won't be able to work with this without at least your crawling
>>>> parameters
>>>> for the server in question.  I can ping that server so if you would like
>>>> I
>>>> can try crawling that server from here.
>>>> 
>>>> For zookeeper, I would still try to either increase your tick count to
>>>> maybe 10000, or better yet, find out why you periodically lose the
>>>> ability
>>>> to transmit pings from MCF to your zookeeper process.
>>>> 
>>>> Thanks,
>>>> Karl
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <e.f.garasen@usit.uio.no
>>>> wrote:
>>>> 
>>>>> On 18.09.14 13:00, Karl Wright wrote:
>>>>> 
>>>>> Hi Erlend,
>>>>>> 
>>>>>> please can you also add the manifoldcf log as well?
>>>>> Yes, I will, but it includes entries from RC0 as well.
>>>>> 
>>>>> MCF works perfectly using the other jobs for the other hosts. Take a
>>>>> look
>>>>> at the following once again. MCF is being interrupted:
>>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL|
>>>>> https://www.duo.uio.no/|1411030940209+682605|-104|
>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
>>>>> 7C4096%7Corg.apache.manifoldcf.core.interfaces.ManifoldCFException%7C>
>>>>> Interrupted: Interrupted: null
>>>>> 
>>>>> You can find this entry near the other regarding the robots.txt file:
>>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>>>>> 
>>>>> Erlend
>> 

Mime
View raw message