manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend Garåsen <e.f.gara...@usit.uio.no>
Subject Re: [VOTE] Release Apache ManifoldCF 1.7.1, RC1
Date Mon, 22 Sep 2014 10:36:31 GMT

Even thought Zookeeper is running on the same machine?

I'm planning to investigate this issue further by using tcpdump. I have 
already turned on DEBUG logging, but nothing suspicious is showing up in 
my logs.

This machine is on a very strict network, and that may cause these 
problems, but it's strange that all the other jobs are working. perfectly.

Erlend

On 22.09.14 12:26, Karl Wright wrote:
> Hi Erlend,
>
> What I think you might want to look for, network-wise, are periods of
> significant packet loss.  Normally your server seems to have no trouble
> talking to either zookeeper or the external network, but periodically, it
> seems to lose that ability for times of at least 20 seconds.  It could be
> bad hardware, it could be routing, hard to tell.
>
> What I'd suggest to prove this is to set up a long-running "ping", e.g.
> ping -n 10000, from that machine to the server that zookeeper is running
> on, and then do a crawl.  I will wager, well, quite a lot of money, that
> you will see periods of packet loss. ;-)
>
> Karl
>
>
> On Mon, Sep 22, 2014 at 5:05 AM, Erlend Garåsen <e.f.garasen@usit.uio.no>
> wrote:
>
>>
>> I'm able to fetch documents from www.duo.uio.no using file-based
>> synchronization, so there are no network problems.
>>
>> Anyway, I'll continue to test RC2. Even though I'm not able to use
>> Zookeeper-based synchronization on that host, I may find other
>> bugs/problems.
>>
>> Erlend
>>
>>
>> On 22.09.14 10:39, Erlend Garåsen wrote:
>>
>>>
>>> I can verify an eventually network problem by using file-based
>>> synchronization instead.
>>>
>>> I'll do that right away and test RC2 as well, even though you already
>>> have three +1's.
>>>
>>> The three other jobs I started before I left my office on Thursday did
>>> all complete successfully.
>>>
>>> Erlend
>>>
>>> On 19.09.14 12:27, Karl Wright wrote:
>>>
>>>> Well, it's crawled fine over night, with no issues whatsoever.  I'm
>>>> using a
>>>> Zookeeper setup, with MCF 1.7.1 RC1.
>>>>
>>>> I still maintain you've got something broken with the network in your
>>>> production machine.
>>>>
>>>> Karl
>>>>
>>>> On Thu, Sep 18, 2014 at 5:31 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>   Well, FWIW it is still crawling perfectly.  I'll let it run until done.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Thu, Sep 18, 2014 at 5:29 PM, Erlend Fedt Garåsen <
>>>>> e.f.garasen@usit.uio.no> wrote:
>>>>>
>>>>>   I know. I used a lot of time to create the rules which seems to index
>>>>>> what we really want. Your observation is correct. Crawling Dspace
>>>>>> repositories are very difficult. A lot of nonsense pages we need
to
>>>>>> filter
>>>>>> out.
>>>>>>
>>>>>> We have crawled this host the last two years using file based synch.
>>>>>>
>>>>>> I'm planning a new approach, i.e. using a connector etc.
>>>>>>
>>>>>> E
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>>   On 18. sep. 2014, at 22:35, "Karl Wright" <daddywri@gmail.com>
wrote:
>>>>>>>
>>>>>>> Ok, I started this crawl.  It fetched and processed robots.txt
>>>>>>>
>>>>>> perfectly.
>>>>>>
>>>>>>> And then I saw the following: lots of fetches of fairly good-sized
>>>>>>> documents, with very few ingestions.  The documents that did
not
>>>>>>> ingest
>>>>>>> look like this:
>>>>>>>
>>>>>>>
>>>>>>>   https://www.duo.uio.no/handle/10852/163/discover?order=DESC&
>>>>>> r...pp=100&sort_by=dc.date.issued_dt
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I think your index inclusion rules may be excluding most of the
>>>>>>> content.
>>>>>>>
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   On Thu, Sep 18, 2014 at 8:48 AM, Karl Wright <daddywri@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>> Thanks -- I will probably not be able to get to this further
until
>>>>>>>>
>>>>>>> tonight
>>>>>>
>>>>>>> anyhow.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Thu, Sep 18, 2014 at 8:16 AM, Erlend Garåsen <
>>>>>>>>
>>>>>>> e.f.garasen@usit.uio.no>
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> I tried to fetch documents by using curl from our prod
server
>>>>>>>>> just in
>>>>>>>>> case a webmaster had blocked access. No problem. Maybe
I should ask
>>>>>>>>>
>>>>>>>> the
>>>>>>
>>>>>>> webmaster of that host anyway, just to be sure.
>>>>>>>>>
>>>>>>>>> The interrupted message may have been caused by an abort
of that
>>>>>>>>> job.
>>>>>>>>>
>>>>>>>>> I think I should just stop the problematic job and start
all the
>>>>>>>>> other
>>>>>>>>> three remaining jobs instead. I bet they will all complete.
>>>>>>>>> Ideally we
>>>>>>>>> shouldn't crawl www.duo.uio.no at all since it's a Dspace
>>>>>>>>> resource. I
>>>>>>>>> have just contacted someone who is indexing Dspace resources.
I
>>>>>>>>> guess
>>>>>>>>>
>>>>>>>> a
>>>>>>
>>>>>>> Dspace connector is a better approach.
>>>>>>>>>
>>>>>>>>> Below you'll find some parameters.
>>>>>>>>>
>>>>>>>>> REPOSITORY CONNECTION
>>>>>>>>> ---------------------
>>>>>>>>> Throttling -> max connections: 30
>>>>>>>>> Throttling -> Max fetches/min: 100
>>>>>>>>> Bandwith -> max connections: 25
>>>>>>>>> Bandwith -> max kbytes/sec: 8000
>>>>>>>>> Bandwith -> max fetches/min: 20
>>>>>>>>>
>>>>>>>>> JOB SETTINGS
>>>>>>>>> ------------
>>>>>>>>>
>>>>>>>>> Hop filters: Keep forever
>>>>>>>>>
>>>>>>>>> Seeds: https://www.duo.uio.no/
>>>>>>>>>
>>>>>>>>> Exclude from crawl:
>>>>>>>>> # Exclude some file types:
>>>>>>>>> \.gif$
>>>>>>>>> \.GIF$
>>>>>>>>> \.jpeg$
>>>>>>>>> \.JPEG$
>>>>>>>>> \.jpg$
>>>>>>>>> \.JPG$
>>>>>>>>> \.png$
>>>>>>>>> \.PNG$
>>>>>>>>> \.mpg$
>>>>>>>>> \.MPG$
>>>>>>>>> \.mpeg$
>>>>>>>>> \.MPEG$
>>>>>>>>> \.exe$
>>>>>>>>> \.bmp$
>>>>>>>>> \.BMP$
>>>>>>>>> \.mov$
>>>>>>>>> \.MOV$
>>>>>>>>> \.wmf$
>>>>>>>>> \.css$
>>>>>>>>> \.ico$
>>>>>>>>> \.ICO$
>>>>>>>>> \.mp2$
>>>>>>>>> \.mp3$
>>>>>>>>> \.mp4$
>>>>>>>>> \.wmv$
>>>>>>>>> \.tif$
>>>>>>>>> \.tiff$
>>>>>>>>> \.avi$
>>>>>>>>> \.ogg$
>>>>>>>>> \.ogv$
>>>>>>>>> \.zip$
>>>>>>>>> \.gz$
>>>>>>>>> \.psd$
>>>>>>>>>
>>>>>>>>> # TIKA-1011
>>>>>>>>> \.mhtml$
>>>>>>>>>
>>>>>>>>> # Exclude log files:
>>>>>>>>> \.log$
>>>>>>>>> \.logfile$
>>>>>>>>>
>>>>>>>>> # Generelt, ikke tillatt indeksering av DUO-søkeresultater:
>>>>>>>>> https?://www\.duo\.uio\.no/sok/search.*
>>>>>>>>>
>>>>>>>>> # Andre elementer i DUO som skal ekskluderes:
>>>>>>>>> https://www\.duo\.uio\.no.*open-search/description\.xml$
>>>>>>>>> https://www\.duo\.uio\.no/(inn|login|feed|search|
>>>>>>>>> advanced-search|community-list|browse|password-login|
>>>>>>>>> inn|discover).*
>>>>>>>>>
>>>>>>>>> # Skip locale settings - makes duplicates:
>>>>>>>>> https://www\.duo\.uio\.no/.*\?locale-attribute=\w{2}$
>>>>>>>>>
>>>>>>>>> # Temporarily skip PDFs since we are indexing abstracts:
>>>>>>>>> https://www\.duo\.uio\.no/bitstream/handle/.+
>>>>>>>>>
>>>>>>>>> # skip full item record:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+\?show=full$
>>>>>>>>> # ny url-struktur:
>>>>>>>>> https://www\.duo\.uio\.no/handle/.*\?show=full$
>>>>>>>>>
>>>>>>>>> # Skip all navigations but "start with letter":
>>>>>>>>> https://www\.duo\.uio\.no/.*type=(author|dateissued)$
>>>>>>>>>
>>>>>>>>> # Skip search:
>>>>>>>>> #https://www\.duo\.uio\.no/handle/.*/discover\?.*
>>>>>>>>> https://www\.duo\.uio\.no/handle/.*search-filter\?.*
>>>>>>>>> # ny url-struktur:
>>>>>>>>> https://www\.duo\.uio\.no/discover\?.*
>>>>>>>>> https://www\.duo\.uio\.no/search-filter\?.*
>>>>>>>>>
>>>>>>>>> # Skip statistics:
>>>>>>>>> https://www\.duo\.uio\.no/handle/.*/statistics$
>>>>>>>>>
>>>>>>>>> Exclude from index:
>>>>>>>>> # Exclude front page - no valuable info and we have QL:
>>>>>>>>> https?://www\.duo\.uio\.no/$
>>>>>>>>>
>>>>>>>>> # Do not index navigation, but follow:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d+/.+
>>>>>>>>> #ny url-struktur:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d+/.+
>>>>>>>>>
>>>>>>>>> # Exclude id's lower than four, probably category listening:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d{9}/\d{1,4}$
>>>>>>>>> # ny url-strultur:
>>>>>>>>> https://www\.duo\.uio\.no/handle/\d+/\d{1,3}$
>>>>>>>>>
>>>>>>>>> Thanks for looking at this!
>>>>>>>>>
>>>>>>>>> BTW: Within an hour, I will be away from my computer
and cannot test
>>>>>>>>> anymore until Monday. I'm leaving Oslo for some days,
but I will
>>>>>>>>>
>>>>>>>> still be
>>>>>>
>>>>>>> able to read and answer emails.
>>>>>>>>>
>>>>>>>>> Erlend
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>   On 18.09.14 13:43, Karl Wright wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Erlend,
>>>>>>>>>>
>>>>>>>>>> The "Interrupted: null" message with a -104 code
means only that
>>>>>>>>>> the
>>>>>>>>>> fetch
>>>>>>>>>> was interrupted by something.  Unfortunately, the
message is not
>>>>>>>>>>
>>>>>>>>> clear
>>>>>>
>>>>>>> about what the cause of the interruption is.  This is unrelated
to
>>>>>>>>>> Zookeeper; but I agree that it is suspicious that
many such
>>>>>>>>>>
>>>>>>>>> interruptions
>>>>>>
>>>>>>> appear right after robots is parsed.
>>>>>>>>>>
>>>>>>>>>> One cause of a -104 is when the target server forcibly
drops the
>>>>>>>>>> connection, so an InterruptedIOException is thrown.
 Having a look
>>>>>>>>>>
>>>>>>>>> at the
>>>>>>
>>>>>>> timestamps for the fetch messages, it looks believable that you
>>>>>>>>>> might
>>>>>>>>>> have
>>>>>>>>>> exceeded some predetermined limit on that machine.
 They're all
>>>>>>>>>>
>>>>>>>>> within a
>>>>>>
>>>>>>> few milliseconds of each other.  When a robots file needs to
be
>>>>>>>>>> read,
>>>>>>>>>> ManifoldCF creates an event for that, and the urls
blocked by that
>>>>>>>>>>
>>>>>>>>> event
>>>>>>
>>>>>>> will all be 'fetchable' as soon as the event is released.  Perhaps
>>>>>>>>>>
>>>>>>>>> your
>>>>>>
>>>>>>> throttling needs to be adjusted now that the rate limit bug has
>>>>>>>>>> been
>>>>>>>>>> fixed?
>>>>>>>>>>
>>>>>>>>>> I won't be able to work with this without at least
your crawling
>>>>>>>>>> parameters
>>>>>>>>>> for the server in question.  I can ping that server
so if you would
>>>>>>>>>>
>>>>>>>>> like
>>>>>>
>>>>>>> I
>>>>>>>>>> can try crawling that server from here.
>>>>>>>>>>
>>>>>>>>>> For zookeeper, I would still try to either increase
your tick count
>>>>>>>>>>
>>>>>>>>> to
>>>>>>
>>>>>>> maybe 10000, or better yet, find out why you periodically lose
the
>>>>>>>>>> ability
>>>>>>>>>> to transmit pings from MCF to your zookeeper process.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 18, 2014 at 7:15 AM, Erlend Garåsen
<
>>>>>>>>>>
>>>>>>>>> e.f.garasen@usit.uio.no
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>   On 18.09.14 13:00, Karl Wright wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Erlend,
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> please can you also add the manifoldcf log
as well?
>>>>>>>>>>>>
>>>>>>>>>>> Yes, I will, but it includes entries from RC0
as well.
>>>>>>>>>>>
>>>>>>>>>>> MCF works perfectly using the other jobs for
the other hosts.
>>>>>>>>>>> Take a
>>>>>>>>>>> look
>>>>>>>>>>> at the following once again. MCF is being interrupted:
>>>>>>>>>>> INFO 2014-09-18 11:13:42,824 (Worker thread '19')
- WEB: FETCH
>>>>>>>>>>> URL|
>>>>>>>>>>> https://www.duo.uio.no/|1411030940209+682605|-104|
>>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>>>
>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>
>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%7C>
>>>>>>>>>>> 4096|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
>>>>>>>>>>> <https://www.duo.uio.no/%7C1411030940209+682605%7C-104%
>>>>>>>>>>>
>>>>>>>>>>>   7C4096%7Corg.apache.manifoldcf.core.interfaces.
>>>>>> ManifoldCFException%7C>
>>>>>>
>>>>>>> Interrupted: Interrupted: null
>>>>>>>>>>>
>>>>>>>>>>> You can find this entry near the other regarding
the robots.txt
>>>>>>>>>>>
>>>>>>>>>> file:
>>>>>>
>>>>>>> http://folk.uio.no/erlendfg/manifoldcf/manifoldcf.log
>>>>>>>>>>>
>>>>>>>>>>> Erlend
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Mime
View raw message