manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF documentum indexing issue
Date Wed, 14 Jun 2017 13:04:31 GMT
Hi,

The exception in the solr.log should be reported as a Solr bug.  It is not
emanating from the Tika extractor (Solr Cell), but is in Solr itself.

I wish there was an easy fix for this.  The problem is *not* an empty
stream; it's that Solr is attempting to do something with it that it
shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
from that.

>>>>>>
https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5
(500)
<<<<<<

Karl




On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
tthamizharasan@worldbankgroup.org> wrote:

> Hi Karl,
>
>
>
> After configuring Solr to ignore Tika errors by adding Tika transformer in
> the job, below behavior is observed.
>
>
>
> 1)      ManifoldCF fetches the content from documentum, which contains
> null content and tries to push it to the output connector(Solr).
>
> 2)      Solr couldn’t accept the null as a value and throwing “Missing
> content stream” error.
>
> 3)      Each agent thread In ManifoldCF internally held-up with different
> r_object_id’s that don’t have body content and keeps trying to push the
> content to Solr  after each failure, but Solr couldn’t accept the content
> and throws the same error.
>
> 4)      Over the time, the manifold job stops with the error thrown by
> Solr
>
>
>
> Please let know if there is any configuration change which can help us
> resolve this issue.
>
>
>
> Please find the attached manifoldCF error log,Solr error log and agent log.
>
>
>
> Regards,
>
> Tamizh Kumaran.
>
>
>
> *From:* Karl Wright [mailto:daddywri@gmail.com]
> *Sent:* Tuesday, June 13, 2017 2:23 PM
> *To:* user@manifoldcf.apache.org
> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
> *Subject:* Re: ManifoldCF documentum indexing issue
>
>
>
> Hi Tamizh,
>
>
>
> The reported error is 'Error from server at http://localhost:8983/solr/
> documentum_manifoldcf_stg: String index out of range: -188'.  The message
> seemingly indicates that the error was *received* from the solr server for
> one specific document.  ManifoldCF does not recognize the error as being
> innocuous and therefore it will retry for a while until it eventually gives
> up and halts the job.  However, I cannot find that exact text anywhere in
> the Solr output connector code, so I wonder if you transcribed it correctly?
>
> There should also be the following:
>
> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
> trace attached to each one;
>
> (2) Simple history records for that document that are of the type
> INGESTDOCUMENT.
>
> (3) Solr log entries that have a Solr stack trace.
>
>
>
> The last one is the one that would be the most helpful.  It is possible
> that you are seeing a problem in Solr Cell (Tika) that is manifesting
> itself in this way.  You can (and should) configure your Solr to ignore
> Tika errors.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
>
>
>
>
> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
> Hi,
>
>
>
> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
> integrated with PostgreSQL 9.3. The expected setup is to crawl the
> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
> app is installed on the tomcat and startup script is pointed with the MF
> properties.xml during server startup. Manifold along with the bundled ZK,
> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
> Server release 6.9 (Santiago). The DB is running on a windows box.
>
> The ZK is integrated with the DB through the properties.xml and
> properties-global.xml
>
> The ZK, the documentum related processes(registry and server) are up and
> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
> produce multiple threads to index the documemtum contents into SOLR through
> ManifoldCF.
>
>
>
> The Current no of the connections configured on the MF are as below.
>
> SOLR Output max connection : 25
>
> Document repository  Max Connections: 25
>
> Properties.xml:
>
> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>
> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>
> Total documentum document count : 0.5 million
>
>
>
> After the Job is started, it indexed some 20000+ documents and gets
> terminated with the below error on the Manifold JOB.
>
> Error: Repeated service interruptions - failure processing document: Error
> from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
> String index out of range: -188
>
>
>
> Please find the attached manifoldCF error log and agent log.
>
>
>
> Please let me know the observations on the cause of the issue and the
> configuration on the threads used  for crawling. Please share your thoughts.
>
>
>
> Regards,
>
> Tamizh Kumaran
>
>
>
>
>

Mime
View raw message