manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Exception in the running Custom Job
Date Mon, 20 Aug 2018 14:00:07 GMT
Obviously your Allowed Documents filter is somehow causing all documents to
be excluded.  Since you have a custom repository connector I would bet
there is a coding error in it that is responsible.

Karl


On Mon, Aug 20, 2018 at 8:49 AM Nikita Ahuja <nikita@smartshore.nl> wrote:

> Hi Karl,
>
> Thanks for reply.
>
> I am using in the same sequence. The allowed document is added first and
> then the Tika Transformation.
>
>
>
>
> But nothing runs in that scenario. The job simply ends without returning
> anything in the output.
>
>
>
>
>
>
> On Mon, Aug 20, 2018 at 5:36 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi,
>>
>> You are running out of memory.
>> Tika's memory consumption is not well defined so you will need to limit
>> the size of documents that reach it.  This is not the same as limiting the
>> size of documents *after* Tika extracts them.
>>
>> The Allowed Documents transformer therefore should be placed in the
>> pipeline before the Tika Extractor.
>>
>> "Also it is not compatible with the Allowed Documents and Metadata
>> Adjuster Connectors."
>>
>> This is a huge red flag.  Why not?
>>
>> Karl
>>
>>
>> On Mon, Aug 20, 2018 at 6:47 AM Nikita Ahuja <nikita@smartshore.nl>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> There is a custom job executing for Aconex in the ManifoldCF
>>> environment. But while executing it is not able to crawl complete set of
>>> documents. It crashes in the middle of the execution.
>>>
>>> Also it is not compatible with the Allowed Documents and Metadata
>>> Adjuster Connectors.
>>>
>>> The custom job created is similar to the existing Jira connector in the
>>> ManifoldCF.
>>>
>>> And it showing this type of error. Please suggest appropriate  steps
>>> which needs to be followed to make it smoothly running.
>>>
>>>
>>>
>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>> [uk1.aconex.co.uk/---.---.---.---
>>> <http://uk1.aconex.co.uk/---.---.---.--->] failed: Read timed out*
>>> *agents process ran out of memory - shutting down*
>>> *agents process ran out of memory - shutting down*
>>> *agents process ran out of memory - shutting down*
>>> *agents process ran out of memory - shutting down*
>>> *java.lang.OutOfMemoryError: Java heap space*
>>> *java.lang.OutOfMemoryError: Java heap space*
>>> *java.lang.OutOfMemoryError: Java heap space*
>>> *        at
>>> org.apache.manifoldcf.core.database.Database.beginTransaction(Database.java:240)*
>>> *        at
>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1361)*
>>> *        at
>>> org.apache.manifoldcf.core.database.DBInterfaceHSQLDB.beginTransaction(DBInterfaceHSQLDB.java:1327)*
>>> *        at
>>> org.apache.manifoldcf.crawler.jobs.JobManager.assessMarkedJobs(JobManager.java:823)*
>>> *        at
>>> org.apache.manifoldcf.crawler.system.AssessmentThread.run(AssessmentThread.java:65)*
>>> *java.lang.OutOfMemoryError: Java heap space*
>>> *        at
>>> org.apache.pdfbox.pdmodel.graphics.state.PDGraphicsState.clone(PDGraphicsState.java:494)*
>>> *        at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.saveGraphicsState(PDFStreamEngine.java:898)*
>>> *        at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:721)*
>>> *        at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.showTextString(PDFStreamEngine.java:587)*
>>> *        at
>>> org.apache.pdfbox.contentstream.operator.text.ShowText.process(ShowText.java:55)*
>>> *        at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:838)*
>>> *        at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)*
>>> *        at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)*
>>> *        at
>>> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)*
>>> *        at
>>> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)*
>>> *        at
>>> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)*
>>> *        at
>>> org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)*
>>> *        at
>>> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)*
>>> *        at
>>> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)*
>>> *        at
>>> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)*
>>> *        at
>>> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:168)*
>>> *        at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>> *        at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)*
>>> *        at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)*
>>> *        at
>>> org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)*
>>> *        at
>>> org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)*
>>> *        at
>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)*
>>> *        at
>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)*
>>> *        at
>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineObjectWithVersions.addOrReplaceDocumentWithException(IncrementalIngester.java:2708)*
>>> *        at
>>> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:756)*
>>> *        at
>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1583)*
>>> *        at
>>> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocumentWithException(WorkerThread.java:1548)*
>>> *        at
>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexSession.fetchAndIndexFile(AconexSession.java:720)*
>>> *        at
>>> org.apache.manifoldcf.crawler.connectors.aconex.AconexRepositoryConnector.processDocuments(AconexRepositoryConnector.java:1194)*
>>> *        at
>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)*
>>> *[Thread-431] INFO org.eclipse.jetty.server.ServerConnector - Stopped
>>> ServerConnector@2c0b4c83{HTTP/1.1}{0.0.0.0:8345 <http://0.0.0.0:8345>}*
>>> *[Thread-431] INFO org.eclipse.jetty.server.handler.ContextHandler -
>>> Stopped
>>> o.e.j.w.WebAppContext@4c03a37{/mcf-api-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-api-service.war-_mcf-api-service-any-3117653580650249372.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-api-service.war}*
>>> *[Thread-431] INFO org.eclipse.jetty.server.handler.ContextHandler -
>>> Stopped
>>> o.e.j.w.WebAppContext@65ae095c{/mcf-authority-service,file:/C:/Users/smartshore/AppData/Local/Temp/jetty-0.0.0.0-8345-mcf-authority-service.war-_mcf-authority-service-any-8288503227579256193.dir/webapp/,UNAVAILABLE}{D:\Manifold\apache-manifoldcf-2.8.1\example\.\..\web\war\mcf-authority-service.war}*
>>> *Connect to uk1.aconex.co.uk:443 <http://uk1.aconex.co.uk:443>
>>> [uk1.aconex.co.uk/23.10.35.84 <http://uk1.aconex.co.uk/23.10.35.84>]
>>> failed: Read timed out*
>>> --
>>> Thanks and Regards,
>>> Nikita
>>> Email: nikita@smartshore.nl
>>> United Sources Service Pvt. Ltd.
>>> a "Smartshore" Company
>>> Mobile: +91 99 888 57720
>>> http://www.smartshore.nl
>>>
>>
>
>
> --
> Thanks and Regards,
> Nikita
> Email: nikita@smartshore.nl
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

Mime
View raw message