manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Hop count problem
Date Mon, 12 Aug 2013 11:48:06 GMT
Hi Erlend,

For the web connector, ManifoldCF should attempt to refetch all documents
every time the job is run.  So unless you were hacking in the database,
once a document becomes reachable again MCF should include it in the
crawl.  There should be no need to "force" MCF to do it again.  The only
time that's appropriate is if you removed all your Solr indexes (in which
case you click the "Reingest all documents" link on the Output Connection's
view page).

If you consistently see fewer documents in production, I'd do some research
to see which documents aren't getting fetched.  You could have network
connectivity issues of some kind.

Karl



On Mon, Aug 12, 2013 at 7:43 AM, Erlend GarĂ¥sen <e.f.garasen@usit.uio.no>wrote:

>
> Thanks Karl,
>
> Maybe some documents became unreachable at the time I tried to reproduce
> some problems I had with this host for some months agp. But the thing is
> that our test environment also crawls 50% more documents for other jobs as
> well. This might be due to unreachable documents.
>
> What is the best approach to tell MCF that all documents should be
> processed again? Manually delete some tables from the database?
>
> Erlend
>
>
> On 8/12/13 1:31 PM, Karl Wright wrote:
>
>> Hi Erlend,
>>
>> If any link in the chain from the seed to the document is broken, a
>> document reachable on a previous crawl can become unreachable and thus
>> report "Hop count exceeded".  In this case, the document must have been
>> queued somehow - or must have been present from a previous crawl.
>>
>> So, for example, suppose you have this chain:
>>
>> A->B->C
>>
>> ... and then all of a sudden, B cannot be fetched.  Then, C will report
>> that its hopcount is exceeded.
>>
>> Based on your report that the test environment works OK, and the
>> production environment does not, I expect there is something like this
>> going on.  I know you attempted to fetch the intervening document from
>> your test environment, but it is conceivable that the production
>> environment is unable to get it.  You should see evidence of that in the
>> simple history, if so.
>>
>> I can try a sample crawl from home tonight if you like, and we can see
>> whether I get the reduced set or the complete one.  However, bear in
>> mind that hopcount is one of MCF's most rigorously tested features, so I
>> personally doubt there is a problem with the hopcount logic per se.
>>
>> Thanks,
>> Karl
>>
>>
>>
>> On Mon, Aug 12, 2013 at 6:39 AM, Erlend GarĂ¥sen <e.f.garasen@usit.uio.no
>> <mailto:e.f.garasen@usit.uio.**no <e.f.garasen@usit.uio.no>>> wrote:
>>
>>
>>     I have discovered an odd thing regarding hop counts. Our prod
>>     environment crawls a lot fewer documents compared to our test
>>     environment even though the configuration is exactly the same. Then
>>     I figured out that several documents which are expected to be
>>     fetched are, according to MCF, outside the hop count limit, but
>>     they're not.
>>
>>     This can be reproduced by using a small job for one particular host,
>>     www.ibsen.uio.no <http://www.ibsen.uio.no>. The seed list is as
>> follows:
>>
>>
>>     http://www.ibsen.uio.no/
>>
>>     Hop filter settings are:
>>     link: 6
>>     redirect: 3
>>
>>     Only these two documents are fetched:
>>     http://www.ibsen.uio.no/__**forside.xhtml<http://www.ibsen.uio.no/__forside.xhtml>
>>
>>     <http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>
>> >
>>     http://www.ibsen.uio.no/
>>
>>     Here's what MCF says about one omitted document, i.e.,
>>     http://www.ibsen.uio.no/__**skuespill.xhtml<http://www.ibsen.uio.no/__skuespill.xhtml>
>>
>>     <http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>
>> >:
>>     State: out of scope
>>     Status: Hopcount exceeded
>>
>>     This is odd. If you open up www.ibsen.uio.no
>>     <http://www.ibsen.uio.no>, you can see that the link
>>     "http://www.ibsen.uio.no/__**skuespill.xhtml<http://www.ibsen.uio.no/__skuespill.xhtml>
>>
>>     <http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>>"
>> (Skuespill) appears on
>>     the main page.
>>
>>     Our test environment fetches this document without problems.
>>
>>     Erlend
>>
>>
>>
>

Mime
View raw message