manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wright (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort
Date Thu, 13 Oct 2016 11:31:20 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571653#comment-15571653
] 

Karl Wright commented on CONNECTORS-1325:
-----------------------------------------

Hi [~kavdeev]: There is no "translation" happening on the MCF side.  Axis 1.4 is used with
httpcomponents/httpclient for transport.  Axis uses the registered XML provider which in this
case is xerces.

The XML printed by the debug message is what is provided by Axis as the SOAP response. If
Axis is rewriting the SOAP, that's not something we can address.  We do not parse that SOAP
response -- Axis does.  We just report it for debugging purposes.

There are two kinds of errors here, then.  The first kind is Axis rewriting the SOAP response
in such a way that it is not parseable.  This is expected because the decimal character value
is not standard Unicode; it cannot be represented as a Java character.  (The 'unicode' value
is 1F600).  So even though the XML is legal, the XML cannot be parsed by Java because it is
limited to standard unicode.

The second kind of problem is that including an entity reference in the XML itself (not a
field) is not allowed.  This is the case you actually care about if I understand correctly.
 Unfortunately, if the XML is illegal, the xml parser will fail to parse it.  That's the end
of the story, I'm afraid.


> Invalid XML character causing job to abort
> ------------------------------------------
>
>                 Key: CONNECTORS-1325
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: SharePoint connector
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Phil
>            Assignee: Karl Wright
>            Priority: Blocker
>             Fix For: ManifoldCF 2.5
>
>         Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, CONNECTORS-1325.patch,
mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently the job not
being able to finish.
> It would be good to have the crawler log this error, but not throw an exception which
causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - Exception tossed:
XML parsing error: Character reference "&#xD83D" is an invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: Character
reference "&#xD83D" is an invalid XML character.
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
>         at org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:286)
>         at org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
>         at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; Character
reference "&#xD83D" is an invalid XML character.
>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
>         ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message