manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort
Date Thu, 23 Jun 2016 06:42:16 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345898#comment-15345898
] 

Karl Wright commented on CONNECTORS-1325:
-----------------------------------------

It looks like what you have here is a utf-16-encoded character in a supplementary plane:

https://en.wikipedia.org/wiki/UTF-16

As such, it should be encoded in XML as a single character with six hex digits.  However,
even if that were represented correctly, that still might not matter since Java's character
representation cannot represent such characters.  However, it might be possible to tell Xerces
to ignore such characters if they were properly encoded, at least.

In lieu of that, skipping the document because there are bad characters in it would seem to
be the only reasonable option.  However it will be brute force because *any* parsing error
would have to be presumed to be a character issue.

> Invalid XML character causing job to abort
> ------------------------------------------
>
>                 Key: CONNECTORS-1325
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: SharePoint connector
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Phil
>            Assignee: Karl Wright
>            Priority: Blocker
>
> The following error is causing the Manifold job to abort, and subsequently the job not
being able to finish.
> It would be good to have the crawler log this error, but not throw an exception which
causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - Exception tossed:
XML parsing error: Character reference "&#xD83D" is an invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: Character
reference "&#xD83D" is an invalid XML character.
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
>         at org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:286)
>         at org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
>         at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; Character
reference "&#xD83D" is an invalid XML character.
>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
>         ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message