manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ronny Heylen <securaqbere...@gmail.com>
Subject Re: Error: Repeated service interruptions - failure processing document: Read timed out
Date Thu, 07 Nov 2013 18:36:29 GMT
Karl,
I don't know where you live but if you come to Belgium, stop in Brussels
for a good Belgian beer ;-)
In other words, setting the socket timeout to 2000 instead of 900 has
solved the problem.
It has indexed about 160,000 documents in 2 hours.
On the other hand, the Manifold/Solr machine (all run in the same Windows
VM) has been allocated 8 3.6GHZ CPU and 32GB memory, and is used only for
the indexing test, no search on SOLR.
So the fact that a timeout of 900 seconds was not enough looks strange: is
it possible that some of these 160,000 docments take more than 15 minutes
to be handled by SOLR?
Ronny&Frédéric


On Thu, Nov 7, 2013 at 4:30 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Ronny,
>
> The failure is being caused because the time spent transferring data to
> Solr is exceeding the socket timeout you have set for the Solr connection,
> for some documents.
>
> This is probably due to excessive load on the Solr instance.  My
> suggestion is to increase the socket timeout on your solr connection to at
> least 30 minutes or more to see if this resolves.
>
> Thanks,
> Karl
>
>
>
> On Thu, Nov 7, 2013 at 9:30 AM, Ronny Heylen <securaqbereusr@gmail.com>wrote:
>
>> Hi,
>> We have reset thottling to 10 for AD and SOLR (2 for the windows
>> repository).
>> Job indexing all pptx to null ouput has run successfully (162733
>> documents)
>> Job indexing all pptx to solr still fails, manifoldcf.log contains:
>>  WARN 2013-11-07 14:34:06,502 (Worker thread '29') - JCIFS: Possibly
>> transient exception detected on attempt 1 while getting share security: All
>> pipe instances are busy.
>> jcifs.smb.SmbException: All pipe instances are busy.
>>     at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563)
>>     at jcifs.smb.SmbTransport.send(SmbTransport.java:663)
>>     at jcifs.smb.SmbSession.send(SmbSession.java:238)
>>     at jcifs.smb.SmbTree.send(SmbTree.java:119)
>>     at jcifs.smb.SmbFile.send(SmbFile.java:775)
>>     at jcifs.smb.SmbFile.open0(SmbFile.java:989)
>>     at jcifs.smb.SmbFile.open(SmbFile.java:1006)
>>     at jcifs.smb.SmbFileOutputStream.<init>(SmbFileOutputStream.java:142)
>>     at
>> jcifs.smb.TransactNamedPipeOutputStream.<init>(TransactNamedPipeOutputStream.java:32)
>>     at
>> jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187)
>>     at
>> jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68)
>>     at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190)
>>     at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126)
>>     at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140)
>>     at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2943)
>>     at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2393)
>>     at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.describeDocumentSecurity(SharedDriveConnector.java:1045)
>>     at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:554)
>>     at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:322)
>>  WARN 2013-11-07 14:55:45,257 (Worker thread '30') - IO exception during
>> indexing: Read timed out
>> java.net.SocketTimeoutException: Read timed out
>>     at java.net.SocketInputStream.socketRead0(Native Method)
>>     at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>     at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>     at
>> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
>>     at
>> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
>>     at
>> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
>>     at
>> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
>>     at
>> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
>>     at
>> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
>>     at
>> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
>>     at
>> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
>>     at
>> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
>>     at
>> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
>>     at
>> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
>>     at
>> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
>>     at
>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
>>     at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>>     at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>>     at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
>>     at
>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:291)
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
>>     at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>     at
>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:919)
>>  WARN 2013-11-07 14:55:45,273 (Worker thread '30') - Service interruption
>> reported for job 1383765534700 connection 'Filesharesrv1': IO exception
>> during indexing: Read timed out
>> ERROR 2013-11-07 14:55:45,304 (Worker thread '30') - Exception tossed:
>> Repeated service interruptions - failure processing document: Read timed out
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
>> service interruptions - failure processing document: Read timed out
>>     at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:586)
>> Caused by: java.net.SocketTimeoutException: Read timed out
>>     at java.net.SocketInputStream.socketRead0(Native Method)
>>     at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>     at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>     at
>> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
>>     at
>> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
>>     at
>> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
>>     at
>> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
>>     at
>> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
>>     at
>> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
>>     at
>> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
>>     at
>> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
>>     at
>> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
>>     at
>> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
>>     at
>> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
>>     at
>> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
>>     at
>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
>>     at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>>     at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>>     at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
>>     at
>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:291)
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
>>     at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>     at
>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:919)
>>  WARN 2013-11-07 15:06:04,235 (Worker thread '9') - IO exception during
>> indexing: Read timed out
>> java.net.SocketTimeoutException: Read timed out
>>     at java.net.SocketInputStream.socketRead0(Native Method)
>>     at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>     at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>     at
>> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
>>     at
>> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
>>     at
>> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
>>     at
>> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
>>     at
>> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
>>     at
>> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
>>     at
>> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
>>     at
>> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
>>     at
>> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
>>     at
>> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
>>     at
>> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
>>     at
>> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
>>     at
>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
>>     at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>>     at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>>     at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
>>     at
>> org.apache.manifoldcf.agents.output.solr.ModifiedHttpSolrServer.request(ModifiedHttpSolrServer.java:291)
>>     at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
>>     at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>     at
>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:919)
>>  WARN 2013-11-07 15:06:04,235 (Worker thread '9') - Service interruption
>> reported for job 1383765534700 connection 'Filesharesrv1': IO exception
>> during indexing: Read timed out
>>
>>
>>
>> On Wed, Nov 6, 2013 at 9:28 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Ronny,
>>>
>>> One minor thing: you should need to set throttling to 2 ONLY for the
>>> Windows repository connection, not for AD or Solr.
>>>
>>>
>>> As for how to debug this issue, first off you should be looking in the
>>> manifoldcf.log file (or the equivalent).  You should see WARN messages from
>>> the shared file connector under most conditions when there's a service
>>> interruption.  You would probably see "Read timed out" warnings if you
>>> looked there, since that is what aborted the job run, along with a stack
>>> trace.  However, that's not going to add much information to the analysis
>>> at this point.
>>>
>>> What might be valuable is to determine whether the problem is happening
>>> on the Windows side or on the Solr side.  At this point I can't tell.  You
>>> could, however, create a null output connection, and create  a similar job
>>> the sends its output there, and see if it completes.  Can you do this and
>>> get back to me?
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 6, 2013 at 3:17 PM, Ronny Heylen <securaqbereusr@gmail.com>wrote:
>>>
>>>> Hi,
>>>> We use Manifoldcf 1.3 and Solr 4.4 to index a shared network drive with
>>>> several hundred thousands documents.
>>>> Doing only one manifoldcf job to index all the drive was always giving
>>>> some kind of error, therefore to better understand where the problem can
>>>> be, we made one job to index all *.doc*, another one for *.xls*, another
>>>> one for *.pdf ...
>>>> Using the help from the list (thanks!) we set the size limit to 100MB
>>>> and all jobs succeeds (great) except the one for *.pptx
>>>> The message is
>>>> Error: Repeated service interruptions - failure processing document:
>>>> Read timed out
>>>> We don't find any error in the log we have searched: solr.log, ...
>>>> Based on some indications found on Internet, we have set the Throttling
>>>> max connections setting to 2 (instead of 10) in 3 places:
>>>> output connection to SOLR
>>>> authority connection to the Active Directory
>>>> repository connection to the windows file share
>>>> But the problem stays the same.
>>>> We have tried on another machine with SOLR 4.5 and Manifoldcf 1.4, same
>>>> problem.
>>>> We can let run the job for all *.PDF, or all *.DOC*, or all *.XLS*
>>>> without problem, but the same message comes always for *.PPTX.
>>>> The last time the job stops with the message, it displays (not the same
>>>> numbers for each run as the windows drive is changing) 56311 documents,
>>>> with 17466 busy and 38847 processed.
>>>> As we don't find anything in the log (but probably we don't look at the
>>>> correct place), we don't know what to do.
>>>> Thanks for your help,
>>>> Ronny and Frédéric
>>>>
>>>
>>>
>>
>

Mime
View raw message