manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <>
Subject [jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
Date Mon, 31 Dec 2018 14:40:00 GMT


Karl Wright commented on CONNECTORS-1562:

Yes, that's the error.  Specifically:

Caused by: Stream Closed
        at Method) ~[?:1.8.0_191]
        at ~[?:1.8.0_191]
        at sun.nio.cs.StreamDecoder.readBytes( ~[?:1.8.0_191]
        at sun.nio.cs.StreamDecoder.implRead( ~[?:1.8.0_191]
        at ~[?:1.8.0_191]
        at ~[?:1.8.0_191]
        at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(

What's happening is that a document is being streamed to ElasticSearch.  The input stream
for the document is being read to do that.  But the stream is being closed early by the web
connector for some reason before it's entirely read.  It's not clear why; it could be a difference
between the size reported by the content type and the actual number of bytes being read, or
it could be the actual web service closing the stream early at some point.

At any rate, it is *one* specific document doing this.  If you can figure out which document
it is, I may be able to come up with a solution.  Is it a very large document?  When you try
to fetch the document using (say) curl, does it completely fetch?  etc.

> Documents unreachable due to hopcount are not considered unreachable on cleanup pass
> ------------------------------------------------------------------------------------
>                 Key: CONNECTORS-1562
>                 URL:
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Elastic Search connector, Web connector
>    Affects Versions: ManifoldCF 2.11
>         Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>            Reporter: Tim Steenbeke
>            Assignee: Karl Wright
>            Priority: Critical
>              Labels: starter
>             Fix For: ManifoldCF 2.12
>         Attachments: Screenshot from 2018-12-31 11-17-29.png, manifoldcf.log.cleanup,
manifoldcf.log.init, manifoldcf.log.reduced
>   Original Estimate: 4h
>  Remaining Estimate: 4h
> My documents aren't removed from ElasticSearch index after rerunning the changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng
even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.

This message was sent by Atlassian JIRA

View raw message