manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: ManifoldCF documentum indexing issue
Date Wed, 21 Jun 2017 10:04:54 GMT
I've created a ticket, CONNECTORS-1434, to look at the file name issues.

Karl


On Wed, Jun 21, 2017 at 5:44 AM, Karl Wright <daddywri@gmail.com> wrote:

> There is no good way to handle a case where Solr doesn't like the file
> name.  About the only thing that could be done would be to encode the
> filename using something like URL encoding.  This might have some effects
> on existing users, but more importantly, we really would need to know what
> characters were legal before adopting that solution.
>
> I am not entirely sure how the file name is transmitted to Solr when using
> multipart forms, but how that is done is critical to know what to do.
>
> Karl
>
>
> On Wed, Jun 21, 2017 at 4:55 AM, Tamizh Kumaran Thamizharasan <
> tthamizharasan@worldbankgroup.org> wrote:
>
>> Hi Karl,
>>
>>
>>
>> Thanks for the update!!!
>>
>>
>>
>> As per the response from Solr team, expandMacros=false is added to the
>> output connector as additional parameter.
>>
>> After adding  expandMacros=false, the indexing job is getting completed
>> with “Missing content stream” error for few of the documents and those are
>> not indexed into Solr.
>>
>>
>>
>> As per our analysis, the pdf document’s file name we are trying to index
>> from documentum  contains whitespace and special characters like double
>> quotes.
>>
>> Which makes the file non readable and missing content stream error is
>> thrown.
>>
>>
>>
>> If there is any work around to overcome this issue, kindly share it with
>> us.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran Thamizharasan
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* Wednesday, June 14, 2017 7:20 PM
>>
>> *To:* user@manifoldcf.apache.org
>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>> *Subject:* Re: ManifoldCF documentum indexing issue
>>
>>
>>
>> Here's the response:
>>
>>
>>
>> >>>>>>
>>
>> Karl -
>>
>> There’s expandMacros=false, as covered here: https://cwiki.apache.org
>> /confluence/display/solr/Parameter+Substitution
>>
>> But… what exactly is being sent to Solr?    Is there some kind of “${…”
>> being sent as a parameter?   Just curious what’s getting you into this in
>> the first place.   But disabling probably is your most desired solution.
>>
>>         Erik
>>
>> <<<<<<
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 9:36 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Here's the question I posted:
>>
>>
>>
>> >>>>>>
>>
>> Hi all,
>>
>>
>>
>> I've got a ManifoldCF user who is posting content to Solr using the MCF
>> Solr output connector.  This connector uses SolrJ under the covers -- a
>> fairly recent version -- but also has overridden some classes to insure
>> that multipart form posts will be used for most content.
>>
>>
>>
>> The problem is that, for a specific document, the user is getting an
>> ArrayIndexOutOfBounds exception in Solr, as follows:
>>
>>
>>
>> >>>>>>
>>
>> 2017-06-14T08:25:16,546 - ERROR [qtp862890654-69725:SolrException@148] -
>> {collection=c:documentum_manifoldcf_stg, core=x:documentum_manifoldcf_stg_shard1_replica1,
>> node_name=n:**********:8983_solr, replica=r:core_node1, shard=s:shard1}
>> - java.lang.StringIndexOutOfBoundsException: String index out of range:
>> -296
>>
>>         at java.lang.String.substring(String.java:1911)
>>
>>         at org.apache.solr.request.macro.MacroExpander._expand(MacroExp
>> ander.java:143)
>>
>>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
>> nder.java:93)
>>
>>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
>> nder.java:59)
>>
>>         at org.apache.solr.request.macro.MacroExpander.expand(MacroExpa
>> nder.java:45)
>>
>>         at org.apache.solr.request.json.RequestUtil.processParams(Reque
>> stUtil.java:157)
>>
>>         at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginU
>> tils.java:172)
>>
>>         at org.apache.solr.handler.RequestHandlerBase.handleRequest(Req
>> uestHandlerBase.java:152)
>>
>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
>>
>>         at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.
>> java:654)
>>
>>         at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:
>> 460)
>>
>>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
>> atchFilter.java:257)
>>
>>         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDisp
>> atchFilter.java:208)
>>
>>         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilte
>> r(ServletHandler.java:1652)
>>
>>         at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHan
>> dler.java:585)
>>
>>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
>> Handler.java:143)
>>
>>         at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHa
>> ndler.java:577)
>>
>>         at org.eclipse.jetty.server.session.SessionHandler.doHandle(
>> SessionHandler.java:223)
>>
>>         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(
>> ContextHandler.java:1127)
>>
>>         at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHand
>> ler.java:515)
>>
>>         at org.eclipse.jetty.server.session.SessionHandler.doScope(
>> SessionHandler.java:185)
>>
>>         at org.eclipse.jetty.server.handler.ContextHandler.doScope(
>> ContextHandler.java:1061)
>>
>>         at org.eclipse.jetty.server.handler.ScopedHandler.handle(Scoped
>> Handler.java:141)
>>
>>         at org.eclipse.jetty.server.handler.ContextHandlerCollection.ha
>> ndle(ContextHandlerCollection.java:215)
>>
>>         at org.eclipse.jetty.server.handler.HandlerCollection.handle(
>> HandlerCollection.java:110)
>>
>>         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(Handl
>> erWrapper.java:97)
>>
>>         at org.eclipse.jetty.server.Server.handle(Server.java:499)
>>
>>         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.
>> java:310)
>>
>>         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConne
>> ction.java:257)
>>
>>         at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnec
>> tion.java:540)
>>
>>         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(Queued
>> ThreadPool.java:635)
>>
>>         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedT
>> hreadPool.java:555)
>>
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> <<<<<<
>>
>>
>>
>> It looks worrisome to me that there's now possibly some kind of "macro
>> expansion" that is being triggered within parameters being sent to Solr.
>> Can anyone tell me either how to (a) disable this feature, or (b) how the
>> MCF Solr output connector should escape parameters being posted so that
>> Solr does not attempt any macro expansion?  If the latter, I also need to
>> know when this feature appeared, since obviously whether or not to do the
>> escaping will depend on the precise version of the Solr instance involved.
>>
>>
>>
>> I'm also quite concerned that considerations of backwards compatibility
>> may have been lost at some point with Solr, since heretofore I could count
>> on older versions of SolrJ working with newer versions of Solr.  Please
>> clarify what the current policy is....
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>> <<<<<<
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 9:35 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> I posted the pertinent question to the solr dev list.  Let's see what
>> they say.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 9:04 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Hi,
>>
>>
>>
>> The exception in the solr.log should be reported as a Solr bug.  It is
>> not emanating from the Tika extractor (Solr Cell), but is in Solr itself.
>>
>>
>>
>> I wish there was an easy fix for this.  The problem is *not* an empty
>> stream; it's that Solr is attempting to do something with it that it
>> shouldn't.  MCF just gets back a 500 error from Solr, and we can't recover
>> from that.
>>
>> >>>>>>
>>
>> https://**********/webtop/component/drl?versionLabel=CURRENT
>> &objectId=091e8486805142f5 (500)
>>
>> <<<<<<
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 14, 2017 at 8:29 AM, Tamizh Kumaran Thamizharasan <
>> tthamizharasan@worldbankgroup.org> wrote:
>>
>> Hi Karl,
>>
>>
>>
>> After configuring Solr to ignore Tika errors by adding Tika transformer
>> in the job, below behavior is observed.
>>
>>
>>
>> 1)      ManifoldCF fetches the content from documentum, which contains
>> null content and tries to push it to the output connector(Solr).
>>
>> 2)      Solr couldn’t accept the null as a value and throwing “Missing
>> content stream” error.
>>
>> 3)      Each agent thread In ManifoldCF internally held-up with
>> different r_object_id’s that don’t have body content and keeps trying to
>> push the content to Solr  after each failure, but Solr couldn’t accept the
>> content and throws the same error.
>>
>> 4)      Over the time, the manifold job stops with the error thrown by
>> Solr
>>
>>
>>
>> Please let know if there is any configuration change which can help us
>> resolve this issue.
>>
>>
>>
>> Please find the attached manifoldCF error log,Solr error log and agent
>> log.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran.
>>
>>
>>
>> *From:* Karl Wright [mailto:daddywri@gmail.com]
>> *Sent:* Tuesday, June 13, 2017 2:23 PM
>> *To:* user@manifoldcf.apache.org
>> *Cc:* Sharnel Merdeck Pereira; Sundarapandian Arumaidurai Vethasigamani
>> *Subject:* Re: ManifoldCF documentum indexing issue
>>
>>
>>
>> Hi Tamizh,
>>
>>
>>
>> The reported error is 'Error from server at http://localhost:8983/solr/
>> documentum_manifoldcf_stg: String index out of range: -188'.  The
>> message seemingly indicates that the error was *received* from the solr
>> server for one specific document.  ManifoldCF does not recognize the error
>> as being innocuous and therefore it will retry for a while until it
>> eventually gives up and halts the job.  However, I cannot find that exact
>> text anywhere in the Solr output connector code, so I wonder if you
>> transcribed it correctly?
>>
>> There should also be the following:
>>
>> (1) A record of the attempts in the manifoldcf.log file, with a MCF stack
>> trace attached to each one;
>>
>> (2) Simple history records for that document that are of the type
>> INGESTDOCUMENT.
>>
>> (3) Solr log entries that have a Solr stack trace.
>>
>>
>>
>> The last one is the one that would be the most helpful.  It is possible
>> that you are seeing a problem in Solr Cell (Tika) that is manifesting
>> itself in this way.  You can (and should) configure your Solr to ignore
>> Tika errors.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 13, 2017 at 3:20 AM, Tamizh Kumaran Thamizharasan <
>> tthamizharasan@worldbankgroup.org> wrote:
>>
>> Hi,
>>
>>
>>
>> The Manifoldcf 2.7.1 is running in the multiprocess zk model and
>> integrated with PostgreSQL 9.3. The expected setup is to crawl the
>> Documentum contents and pushed on to the output SOLR 5.3.2. The crawler-ui
>> app is installed on the tomcat and startup script is pointed with the MF
>> properties.xml during server startup. Manifold along with the bundled ZK,
>> tomcat are running on the same host with OS as  Red Hat Enterprise Linux
>> Server release 6.9 (Santiago). The DB is running on a windows box.
>>
>> The ZK is integrated with the DB through the properties.xml and
>> properties-global.xml
>>
>> The ZK, the documentum related processes(registry and server) are up and
>> the  two agents (start-agents.sh and start-agents-2.sh) are started  which
>> produce multiple threads to index the documemtum contents into SOLR through
>> ManifoldCF.
>>
>>
>>
>> The Current no of the connections configured on the MF are as below.
>>
>> SOLR Output max connection : 25
>>
>> Document repository  Max Connections: 25
>>
>> Properties.xml:
>>
>> <property name="org.apache.manifoldcf.database.maxhandles" value="50"/>
>>
>> <property name="org.apache.manifoldcf.crawler.threads" value="25"/>
>>
>> Total documentum document count : 0.5 million
>>
>>
>>
>> After the Job is started, it indexed some 20000+ documents and gets
>> terminated with the below error on the Manifold JOB.
>>
>> Error: Repeated service interruptions - failure processing document:
>> Error from server at http://localhost:8983/solr/documentum_manifoldcf_stg:
>> String index out of range: -188
>>
>>
>>
>> Please find the attached manifoldCF error log and agent log.
>>
>>
>>
>> Please let me know the observations on the cause of the issue and the
>> configuration on the threads used  for crawling. Please share your thoughts.
>>
>>
>>
>> Regards,
>>
>> Tamizh Kumaran
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Mime
View raw message