manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <>
Subject RE: Error in Manifoldcf, what's the first step?
Date Tue, 29 Oct 2013 11:12:24 GMT
Based on the error message, Adrian is correct and this is once again a solr
side problem.  Since solr puts all documents into memory, my guess is that
you are attempting to index some very large documents and those are causing
solr to run out of memory.  Either exclude these from the crawl or set a
reasonable maximum length.


Sent from my Windows Phone
From: Ronny Heylen
Sent: 10/29/2013 6:52 AM
Subject: Error in Manifoldcf, what's the first step?


Solr is 4.4, manifoldcf 1.3.

We are indexing a shared windows network drive, filtering on *.doc*,
*.xls*, *.pdf ... with about 650,000 files to index, giving a SOLR index
35GB in size.

The result is great except that the manifoldcf job crashes before the end.

Note that:
- ignoreTikaException is true in solrconfig.xml (otherwise the manifoldcf
job stops very early).
- tomcat has been given 24 GB of memory (it uses 15GB)
- there are 8 cores

Message in http://localhost:8080/mcf-crawler-ui/showjobstatus.jsp is:
Error: Repeated service interruptions - failure processing document: Server
at http://localhost:8080/solr/collection1 returned non ok status:500,
message:Internal Server Error

Then, instead of indexing the full drive in one job, we have defined one
job for each subfolder.

Almost all "subfolder" jobs end successfully, only for 2 or 3 we receive
the same message, and for 2 or 3 other ones a different message:

Error: Repeated service interruptions - failure processing document: Read
timed out

If we try to go further (defining one job for each subfolder of a subfolder
in error), the same happens: success for almost all subfolders except 1 or

What is the first step to do to solve this problem?


View raw message