manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
Date Thu, 18 Sep 2014 21:31:40 GMT
Well, FWIW it is still crawling perfectly.  I'll let it run until done.

Karl


On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen <
e.f.garasen@usit.uio.no> wrote:

> I know. I used a lot of time to create the rules which seems to index what
> we really want. Your observation is correct. Crawling Dspace repositories
> are very difficult. A lot of nonsense pages we need to filter out.
>
> We have crawled this host the last two years using file based synch.
>
> I'm planning a new approach, i.e. using a connector etc.
>
> E
>
> Sent from my iPhone
>
> > On 18. sep. 2014, at 22:35, "Karl Wright" <daddywri@gmail.com> wrote:
> >
> > Ok, I started this crawl.  It fetched and processed robots.txt perfectly.
> > And then I saw the following: lots of fetches of fairly good-sized
> > documents, with very few ingestions.  The documents that did not ingest
> > look like this:
> >
> >
> https://www.duo.uio.no/handle/10852/163/discover?order=DESC&r...pp=100&sort_by=dc.date.issued_dt
> >
> >
> > I think your index inclusion rules may be excluding most of the content.
> >
> > Karl
> >
> >
> >
> >> On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>
> >> Thanks -- I will probably not be able to get to this further until
> tonight
> >> anyhow.
> >>
> >> Karl
> >>
> >> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <
> e.f.garasen@usit.uio.no>
> >> wrote:
> >>
> >>>
> >>> I tried to fetch documents by using curl from our prod server just in
> >>> case a webmaster had blocked access. No problem. Maybe I should ask the
> >>> webmaster of that host anyway, just to be sure.
> >>>
> >>> The interrupted message may have been caused by an abort of that job.
> >>>
> >>> I think I should just stop the problematic job and start all the other
> >>> three remaining jobs instead. I bet they will all complete. Ideally we
> >>> shouldn't crawl www.duo.uio.no at all since it's a Dspace resource. I
> >>> have just contacted someone who is indexing Dspace resources. I guess a
> >>> Dspace connector is a better approach.
> >>>
> >>> Below you'll find some parameters.
> >>>
> >>> REPOSITORY CONNECTION
> >>> ---------------------
> >>> Throttling -> max connections: 30
> >>> Throttling -> Max fetches/min: 100
> >>> Bandwith -> max connections: 25
> >>> Bandwith -> max kbytes/sec: 8000
> >>> Bandwith -> max fetches/min: 20
> >>>
> >>> JOB SETTINGS
> >>> ------------
> >>>
> >>> Hop filters: Keep forever
> >>>
> >>> Seeds: https://www.duo.uio.no/
> >>>
> >>> Exclude from crawl:
> >>> # Exclude some file types:
> >>> \.gif$
> >>> \.GIF$
> >>> \.jpeg$
> >>> \.JPEG$
> >>> \.jpg$
> >>> \.JPG$
> >>> \.png$
> >>> \.PNG$
> >>> \.mpg$
> >>> \.MPG$
> >>> \.mpeg$
> >>> \.MPEG$
> >>> \.exe$
> >>> \.bmp$
> >>> \.BMP$
> >>> \.mov$
> >>> \.MOV$
> >>> \.wmf$
> >>> \.css$
> >>> \.ico$
> >>> \.ICO$
> >>> \.mp2$
> >>> \.mp3$
> >>> \.mp4$
> >>> \.wmv$
> >>> \.tif$
> >>> \.tiff$
> >>> \.avi$
> >>> \.ogg$
> >>> \.ogv$
> >>> \.zip$
> >>> \.gz$
> >>> \.psd$
> >>>
> >>> # TIKA-1011
> >>> \.mhtml$
> >>>
> >>> # Exclude log files:
> >>> \.log$
> >>> \.logfile$
> >>>
> >>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
> >>> https?://www\.duo\.uio\.no/sok/search.*
> >>>
> >>> # Andre elementer i DUO som skal ekskluderes:
> >>> https://www\.duo\.uio\.no.*open-search/description\.xml$
> >>> https://www\.duo\.uio\.no/(inn|login|feed|search|
> >>> advanced-search|community-list|browse|password-login|inn|discover).*
> >>>
> >>> # Skip locale settings - makes duplicates:
> >>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
> >>>
> >>> # Temporarily skip PDFs since we are indexing abstracts:
> >>> https://www\.duo\.uio\.no/bitstream/handle/.+
> >>>
> >>> # skip full item record:
> >>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
> >>> # ny url-struktur:
> >>> https://www\.duo\.uio\.no/handle/.*\?show=full$
> >>>
> >>> # Skip all navigations but "start with letter":
> >>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
> >>>
> >>> # Skip search:
> >>> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
> >>> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
> >>> # ny url-struktur:
> >>> https://www\.duo\.uio\.no/discover\?.*
> >>> https://www\.duo\.uio\.no/search-filter\?.*
> >>>
> >>> # Skip statistics:
> >>> https://www\.duo\.uio\.no/handle/.*/statistics$
> >>>
> >>> Exclude from index:
> >>> # Exclude front page - no valuable info and we have QL:
> >>> https?://www\.duo\.uio\.no/$
> >>>
> >>> # Do not index navigation, but follow:
> >>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
> >>> #ny url-struktur:
> >>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
> >>>
> >>> # Exclude id's lower than four, probably category listening:
> >>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
> >>> # ny url-strultur:
> >>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
> >>>
> >>> Thanks for looking at this!
> >>>
> >>> BTW: Within an hour, I will be away from my computer and cannot test
> >>> anymore until Monday. I'm leaving Oslo for some days, but I will still
> be
> >>> able to read and answer emails.
> >>>
> >>> Erlend
> >>>
> >>>
> >>>> On 18.09.14 13:43, Karl Wright wrote:
> >>>>
> >>>> Hi Erlend,
> >>>>
> >>>> The "Interrupted: null" message with a -104 code means only that the
> >>>> fetch
> >>>> was interrupted by something.  Unfortunately, the message is not clear
> >>>> about what the cause of the interruption is.  This is unrelated to
> >>>> Zookeeper; but I agree that it is suspicious that many such
> interruptions
> >>>> appear right after robots is parsed.
> >>>>
> >>>> One cause of a -104 is when the target server forcibly drops the
> >>>> connection, so an InterruptedIOException is thrown.  Having a look at
> the
> >>>> timestamps for the fetch messages, it looks believable that you might
> >>>> have
> >>>> exceeded some predetermined limit on that machine.  They're all
> within a
> >>>> few milliseconds of each other.  When a robots file needs to be read,
> >>>> ManifoldCF creates an event for that, and the urls blocked by that
> event
> >>>> will all be 'fetchable' as soon as the event is released.  Perhaps
> your
> >>>> throttling needs to be adjusted now that the rate limit bug has been
> >>>> fixed?
> >>>>
> >>>> I won't be able to work with this without at least your crawling
> >>>> parameters
> >>>> for the server in question.  I can ping that server so if you would
> like
> >>>> I
> >>>> can try crawling that server from here.
> >>>>
> >>>> For zookeeper, I would still try to either increase your tick count
to
> >>>> maybe 10000, or better yet, find out why you periodically lose the
> >>>> ability
> >>>> to transmit pings from MCF to your zookeeper process.
> >>>>
> >>>> Thanks,
> >>>> Karl
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen <
> e.f.garasen@usit.uio.no
> >>>> wrote:
> >>>>
> >>>>> On 18.09.14 13:00, Karl Wright wrote:
> >>>>>
> >>>>> Hi Erlend,
> >>>>>>
> >>>>>> please can you also add the manifoldcf log as well?
> >>>>> Yes, I will, but it includes entries from RC0 as well.
> >>>>>
> >>>>> MCF works perfectly using the other jobs for the other hosts. Take
a
> >>>>> look
> >>>>> at the following once again. MCF is being interrupted:
> >>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19') - WEB: FETCH URL|
> >>>>> https://www.duo.uio.no/|1411030940209+682605|-104|
> >>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
> >>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> >>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
> >>>>>
> 7C4096%7Corg.apache.manifoldcf.core.interfaces.ManifoldCFException%7C>
> >>>>> Interrupted: Interrupted: null
> >>>>>
> >>>>> You can find this entry near the other regarding the robots.txt
file:
> >>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
> >>>>>
> >>>>> Erlend
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message