manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF documentum indexing issue
Date Wed, 14 Jun 2017 13:49:42 GMT
Here's the response:

>>>>>>
Karl -

There’s expandMacros=false, as covered here: https://cwiki.apache.org/
confluence/display/solr/Parameter+Substitution

But… what exactly is being sent to Solr?    Is there some kind of “${…”
being sent as a parameter?   Just curious what’s getting you into this in
the first place.   But disabling probably is your most desired solution.

        Erik
<<<<<<

Karl


On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <daddywri@gmail.com> wrote:

> Here's the question I posted:
>
> >>>>>>
> Hi all,
>
> I've got a ManifoldCF user who is posting content to Solr using the MCF
> Solr output connector.  This connector uses SolrJ under the covers -- a
> fairly recent version -- but also has overridden some classes to insure
> that multipart form posts will be used for most content.
>
> The problem is that, for a specific document, the user is getting an
> ArrayIndexOutOfBounds exception in Solr, as follows:
>
> >>>>>>
> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
> {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1,
> node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1} -
> java.lang.StringIndexOutOfBoundsException: String index out of range: -296
>         at java.lang.String.substring(String.java:1911)
>         at org.apache.solr.request.macro.MacroExpander._expand(MacroExp
> ander.java:143)
>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
> nder.java:93)
>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
> nder.java:59)
>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
> nder.java:45)
>         at org.apache.solr.request.json.RequestUtil.processParams(Reque
> stUtil.java:157)
>         at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginU
> tils.java:172)
>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
> uestHandlerBase.java:152)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
>         at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.
> java:654)
>         at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
> 460)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> atchFilter.java:257)
>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
> atchFilter.java:208)
>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
> r(ServletHandler.java:1652)
>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
> dler.java:585)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> Handler.java:143)
>         at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
> ndler.java:577)
>         at org.eclipse.jetty.server.session.SessionHandler.doHandle(
> SessionHandler.java:223)
>         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(
> ContextHandler.java:1127)
>         at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
> ler.java:515)
>         at org.eclipse.jetty.server.session.SessionHandler.doScope(
> SessionHandler.java:185)
>         at org.eclipse.jetty.server.handler.ContextHandler.doScope(
> ContextHandler.java:1061)
>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
> Handler.java:141)
>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
> ndle(ContextHandlerCollection.java:215)
>         at org.eclipse.jetty.server.handler.HandlerCollection.handle(
> HandlerCollection.java:110)
>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
> erWrapper.java:97)
>         at org.eclipse.jetty.server.Server.handle(Server.java:499)
>         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
> java:310)
>         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
> ction.java:257)
>         at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnec
> tion.java:540)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
> ThreadPool.java:635)
>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedT
> hreadPool.java:555)
>         at java.lang.Thread.run(Thread.java:745)
> <<<<<<
>
> It looks worrisome to me that there's now possibly some kind of "macro
> expansion" that is being triggered within parameters being sent to Solr.
> Can anyone tell me either how to (a) disable this feature, or (b) how the
> MCF Solr output connector should escape parameters being posted so that
> Solr does not attempt any macro expansion?  If the latter, I also need to
> know when this feature appeared, since obviously whether or not to do the
> escaping will depend on the precise version of the Solr instance involved.
>
> I'm also quite concerned that considerations of backwards compatibility
> may have been lost at some point with Solr, since heretofore I could count
> on older versions of SolrJ working with newer versions of Solr.  Please
> clarify what the current policy is....
>
>
> Thanks,
> Karl
> <<<<<<
>
>
>
> On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> I posted the pertinent question to the solr dev list.  Let's see what
>> they say.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> The exception in the solr.log should be reported as a Solr bug.  It is
>>> not emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>>>
>>> I wish there was an easy fix for this.  The problem is *not* an empty
>>> stream; it's that Solr is attempting to do something with it that it
>>> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
>>> from that.
>>>
>>> >>>>>>
>>> https://**********/webtop/component/drl?versionLabel=CURRENT&objectId=091e8486805142f5
>>> (500)
>>> <<<<<<
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
>>> tthamizharasan@worldbankgroup.org> wrote:
>>>
>>>> Hi Karl,
>>>>
>>>>
>>>>
>>>> After configuring Solr to ignore Tika errors by adding Tika transformer
>>>> in the job, below behavior is observed.
>>>>
>>>>
>>>>
>>>> 1)      ManifoldCF fetches the content from documentum, which contains
>>>> null content and tries to push it to the output connector(Solr).
>>>>
>>>> 2)      Solr couldn’t accept the null as a value and throwing “Missing
>>>> content stream” error.
>>>>
>>>> 3)      Each agent thread In ManifoldCF internally held-up with
>>>> different r_object_id’s that don’t have body content and keeps trying
to
>>>> push the content to Solr  after each failure, but Solr couldn’t accept
the
>>>> content and throws the same error.
>>>>
>>>> 4)      Over the time, the manifold job stops with the error thrown by
>>>> Solr
>>>>
>>>>
>>>>
>>>> Please let know if there is any configuration change which can help us
>>>> resolve this issue.
>>>>
>>>>
>>>>
>>>> Please find the attached manifoldCF error log,Solr error log and agent
>>>> log.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Tamizh Kumaran.
>>>>
>>>>
>>>>
>>>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>>>> *Sent:* Tuesday, June 13, 2017 2:23 PM
>>>> *To:* user@manifoldcf.apache.org
>>>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>>>> *Subject:* Re: ManifoldCF documentum indexing issue
>>>>
>>>>
>>>>
>>>> Hi Tamizh,
>>>>
>>>>
>>>>
>>>> The reported error is 'Error from server at http://localhost:8983/solr/
>>>> documentum_manifoldcf_stg: String index out of range: -188'.  The
>>>> message seemingly indicates that the error was *received* from the solr
>>>> server for one specific document.  ManifoldCF does not recognize the error
>>>> as being innocuous and therefore it will retry for a while until it
>>>> eventually gives up and halts the job.  However, I cannot find that exact
>>>> text anywhere in the Solr output connector code, so I wonder if you
>>>> transcribed it correctly?
>>>>
>>>> There should also be the following:
>>>>
>>>> (1) A record of the attempts in the manifoldcf.log file, with a MCF
>>>> stack trace attached to each one;
>>>>
>>>> (2) Simple history records for that document that are of the type
>>>> INGESTDOCUMENT.
>>>>
>>>> (3) Solr log entries that have a Solr stack trace.
>>>>
>>>>
>>>>
>>>> The last one is the one that would be the most helpful.  It is possible
>>>> that you are seeing a problem in Solr Cell (Tika) that is manifesting
>>>> itself in this way.  You can (and should) configure your Solr to ignore
>>>> Tika errors.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
>>>> tthamizharasan@worldbankgroup.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
>>>> integrated with PostgreSQL 9.3. The expected setup is to crawl the
>>>> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
>>>> app is installed on the tomcat and startup script is pointed with the MF
>>>> properties.xml during server startup. Manifold along with the bundled ZK,
>>>> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
>>>> Server release 6.9 (Santiago). The DB is running on a windows box.
>>>>
>>>> The ZK is integrated with the DB through the properties.xml and
>>>> properties-global.xml
>>>>
>>>> The ZK, the documentum related processes(registry and server) are up
>>>> and the  two agents (start-agents.sh and start-agents-2.sh) are started
>>>> which produce multiple threads to index the documemtum contents into SOLR
>>>> through ManifoldCF.
>>>>
>>>>
>>>>
>>>> The Current no of the connections configured on the MF are as below.
>>>>
>>>> SOLR Output max connection : 25
>>>>
>>>> Document repository  Max Connections: 25
>>>>
>>>> Properties.xml:
>>>>
>>>> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>>>>
>>>> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>>>>
>>>> Total documentum document count : 0.5 million
>>>>
>>>>
>>>>
>>>> After the Job is started, it indexed some 20000+ documents and gets
>>>> terminated with the below error on the Manifold JOB.
>>>>
>>>> Error: Repeated service interruptions - failure processing document:
>>>> Error from server at http://localhost:8983/solr/doc
>>>> umentum_manifoldcf_stg: String index out of range: -188
>>>>
>>>>
>>>>
>>>> Please find the attached manifoldCF error log and agent log.
>>>>
>>>>
>>>>
>>>> Please let me know the observations on the cause of the issue and the
>>>> configuration on the threads used  for crawling. Please share your thoughts.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Tamizh Kumaran
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Mime
View raw message