manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Zeng <ze...@hotmail.co.uk>
Subject RE: Sharepoint 2013 Crawling a large list
Date Sat, 09 Apr 2016 13:44:32 GMT



Hi Karl,
Thank you very much for your reply. MCF processed all the items in the large list with no
errors when I switched to Postgresql. Your suggestion is very helpful. Thank you for your
suggestion. Best regards,
Cheng

Date: Fri, 8 Apr 2016 06:05:38 -0400
Subject: Re: Sharepoint 2013 Crawling a large list
From: daddywri@gmail.com
To: user@manifoldcf.apache.org

Hi Cheng,
That is a pretty impressively messed up system!
Let's start with what we know and then go on to what we don't.
The "Remote procedure exception" error is due to an org.apache.axis.AxisFault exception that
is not apparently coming from the server.  That's pretty weird in its own right.  Equally
weird is the NPE coming from within HttpClient during NTLM processing.  Unfortunately we aren't
seeing the actual stack traces themselves, which would allow us to figure out what was happening;
instead you are getting ArrayIndexOutOfBounds and NullPointerExceptions doing basic things
like array copying (!).
Can you include one or two of the actual traces (with line numbers?)
My sense is that (a) you are using a non-standard JVM that is (b) running out of memory, but
not throwing an out of memory exception when that happens.  Rather, it's blowing up and not
allocating memory that it needs instead.  It's running out of memory most likely because (c)
you are using Hsqldb, and hsqldb is keeping its database tables in memory, which is what it
does.
I would recommend either (1) give MCF more memory, or (2) better yet, switch to Postgresql.
 And if this keeps happening under either scenario, please include a few of the full traces
so I can make better sense of the problem.
Please let us know what happens.
Thanks,Karl

On Fri, Apr 8, 2016 at 3:32 AM, Cheng Zeng <zengc@hotmail.co.uk> wrote:



Hi,
I am trying to extract web pages and attachments from Sharepoint 2013 and upload these data
to solr for indexing. 
I have installed the Sharepoint plugin on sharepoint 2013 server and been able to use manifoldCF
to fetch items from the lists with less than 160 items. My problem is that there are a few
lists which have more than 4,900 items. When manifoldCF tried to crawl on these large lists,
it turned out that it started to process items very slow and seems to stop working, after
2,100 items were processed. I tried to slow down the speed to upload the items to the solr
instance by forcing the working thread to sleep for 3 seconds after every 50 items were added
to the pipeline. I tried to slow down the speed several times but manifoldCF starts to process
items very slow as long as 2,100 items in the list were processed. It is noted that manifoldCF
 starts to process items very slow after around 30 minutes since the crawling job starts and
the errors are tossed as follows.
WARN 2016-04-08 12:29:14,762 (Worker thread '19') - Service interruption reported for job
1460088455222 connection 'SharepointRepoistoryConn': Remote procedure exception: ; nested
exception is: 	java.lang.ArrayIndexOutOfBoundsExceptionFATAL 2016-04-08 12:29:14,777 (Worker
thread '28') - Error tossed: nulljava.lang.NullPointerExceptionFATAL 2016-04-08 12:30:37,611
(Worker thread '29') - Error tossed: nulljava.lang.NullPointerException

The log is attached.  If someone could help me, I would really appreciated it.
Best regards,
Cheng 		 	   		  


 		 	   		  
Mime
View raw message