manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phil (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CONNECTORS-1325) Invalid XML character causing job to abort
Date Thu, 30 Jun 2016 00:44:12 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356267#comment-15356267
] 

Phil edited comment on CONNECTORS-1325 at 6/30/16 12:43 AM:
------------------------------------------------------------

Hi [~daddywri],

I'm finding after installing the patch that it does ignore the error. However, the crawler
is continuing to attempt to process this document (or at least the metadata), resulting in
the crawler never finishing. Its currently been running for a few days.

I tailed the logs for a particular document using the following:
{{tail -f manifoldcf.log | grep "<DOCUMENT_URL>"}}

Which resulted in the following lines being repeated:
{code}
DEBUG 2016-06-30 09:59:32,928 (Worker thread '13') sharepoint.SharePointRepository - SharePoint:
Finding metadata to include for document/item <DOCUMENT_URL>
DEBUG 2016-06-30 09:59:32,946 (Worker thread '13') sharepoint.SPSProxyHelper - SharePoint:
In getFieldValues; fieldNames= ....
DEBUG 2016-06-30 09:59:33,100 (Worker thread '27') sharepoint.SharePointRepository - SharePoint:
Getting version of <DOCUMENT_URL>
DEBUG 2016-06-30 09:59:33,100 (Worker thread '27') sharepoint.SharePointRepository - SharePoint:
Checking whether to include list item ....

.....
....
{code}

I've omitted some repository specific details, but let me know if you want any further details.

Any idea why this might be happening?

Thanks


was (Author: priethmuller):
Hi [~daddywri],

I'm finding after installing the patch that it does ignore the error. However, the crawler
is continuing to attempt to process this document (or at least hte metadata), resulting in
the crawler never finishing. Its currently being running for a few days.

I tailed the logs for a particular document using the following:
{{tail -f manifoldcf.log | grep "<DOCUMENT_URL>"}}

Which resulted in the following lines being repeated:
{code}
DEBUG 2016-06-30 09:59:32,928 (Worker thread '13') sharepoint.SharePointRepository - SharePoint:
Finding metadata to include for document/item <DOCUMENT_URL>
DEBUG 2016-06-30 09:59:32,946 (Worker thread '13') sharepoint.SPSProxyHelper - SharePoint:
In getFieldValues; fieldNames= ....
DEBUG 2016-06-30 09:59:33,100 (Worker thread '27') sharepoint.SharePointRepository - SharePoint:
Getting version of <DOCUMENT_URL>
DEBUG 2016-06-30 09:59:33,100 (Worker thread '27') sharepoint.SharePointRepository - SharePoint:
Checking whether to include list item ....

.....
....
{code}

I've omitted some repository specific details, but let me know if you want any further details.

Any idea why this might be happening?

Thanks

> Invalid XML character causing job to abort
> ------------------------------------------
>
>                 Key: CONNECTORS-1325
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: SharePoint connector
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Phil
>            Assignee: Karl Wright
>            Priority: Blocker
>             Fix For: ManifoldCF 2.5
>
>         Attachments: CONNECTORS-1325.patch
>
>
> The following error is causing the Manifold job to abort, and subsequently the job not
being able to finish.
> It would be good to have the crawler log this error, but not throw an exception which
causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - Exception tossed:
XML parsing error: Character reference "&#xD83D" is an invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: Character
reference "&#xD83D" is an invalid XML character.
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
>         at org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:286)
>         at org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
>         at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; Character
reference "&#xD83D" is an invalid XML character.
>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
>         ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message