manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Donald Van den Driessche (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
Date Wed, 09 Jan 2019 13:25:00 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738231#comment-16738231
] 

Donald Van den Driessche commented on CONNECTORS-1562:
------------------------------------------------------

Kar

After resolving the issues with the API creation repository connections, I retested our crawling
locally. With a docker which contains a ManifoldCF and an Elasticsearch container.

I used No Bandwith throttles and a max connection count of 25.

This on the seedmap existing of 1 page, our whitelist: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=nl&html=true]
I still get the Stream Closed I/O exception:.

Do you have any more ideas on how to keep the connection open, so that the whole whitelist
can be processed?

 

Printscreen Simple Report

!image-2019-01-09-14-20-50-616.png!

Stacktrace
{code:java}
ERROR 2019-01-09T13:08:37,876 (Worker thread '22') - Exception tossed: Repeated service interruptions
- failure processing document: Stream Closed org.apache.manifoldcf.core.interfaces.ManifoldCFException:
Repeated service interruptions - failure processing document: Stream Closed at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489)
[mcf-pull-agent.jar:?] Caused by: java.io.IOException: Stream Closed at java.io.FileInputStream.readBytes(Native
Method) ~[?:1.8.0_191] at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191]
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191] at
java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_191] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221)
~[?:?] at org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156)
~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238)
~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133)
~[?:?] ERROR 2019-01-09T13:08:37,883 (Worker thread '7') - Exception tossed: Repeated service
interruptions - failure processing document: Stream Closed org.apache.manifoldcf.core.interfaces.ManifoldCFException:
Repeated service interruptions - failure processing document: Stream Closed at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489)
[mcf-pull-agent.jar:?] Caused by: java.io.IOException: Stream Closed at java.io.FileInputStream.readBytes(Native
Method) ~[?:1.8.0_191] at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191]
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191] at
java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_191] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221)
~[?:?] at org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156)
~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238)
~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
~[httpclient-4.5.6.jar:4.5.6] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133)
~[?:?]
 {code}

> Documents unreachable due to hopcount are not considered unreachable on cleanup pass
> ------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1562
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>
>         Attachments: Screenshot from 2018-12-31 11-17-29.png, image-2019-01-09-14-20-50-616.png,
manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message