manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Conlon <>
Subject RE: Error in Manifoldcf, what's the first step?
Date Tue, 29 Oct 2013 11:00:25 GMT

It doesn't look like the problem is on the ManifoldCF crawling side, more on the Solr indexing

What, if anything, do the Solr logs say about the problem?


From: Ronny Heylen []
Sent: 29 October 2013 10:52
Subject: Error in Manifoldcf, what's the first step?


Solr is 4.4, manifoldcf 1.3.

We are indexing a shared windows network drive, filtering on *.doc*, *.xls*, *.pdf ... with
about 650,000 files to index, giving a SOLR index 35GB in size.

The result is great except that the manifoldcf job crashes before the end.

Note that:
- ignoreTikaException is true in solrconfig.xml (otherwise the manifoldcf job stops very early).
- tomcat has been given 24 GB of memory (it uses 15GB)
- there are 8 cores

Message in http://localhost:8080/mcf-crawler-ui/showjobstatus.jsp is:
Error: Repeated service interruptions - failure processing document: Server at http://localhost:8080/solr/collection1
returned non ok status:500, message:Internal Server Error
Then, instead of indexing the full drive in one job, we have defined one job for each subfolder.
Almost all "subfolder" jobs end successfully, only for 2 or 3 we receive the same message,
and for 2 or 3 other ones a different message:

Error: Repeated service interruptions - failure processing document: Read timed out
If we try to go further (defining one job for each subfolder of a subfolder in error), the
same happens: success for almost all subfolders except 1 or 2.
What is the first step to do to solve this problem?
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

View raw message