manifoldcf-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Avdeev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CONNECTORS-1325) Invalid XML character causing job to abort
Date Thu, 13 Oct 2016 09:51:20 GMT

    [ https://issues.apache.org/jira/browse/CONNECTORS-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571456#comment-15571456
] 

Konstantin Avdeev commented on CONNECTORS-1325:
-----------------------------------------------

An important update!

I tested the "bad" char again by looking into the network traffic (http wire = DEBUG), to
make sure what exactly comes from Sharpoint:

and it turned out, that this emoji char gets translated into a "wrong" format on MCF side:
& # 128512; ---> & # xD83D;& # xDE00;

{code}
DEBUG 2016-10-13 11:39:45,460 (Thread-2572) - http-outgoing-100 << "#' ows__ModerationStatus='0'
ows__Level='1' ows_Title='Task emoji &gt;&gt;&gt;&#128512;&lt;&lt;&lt;'
ows_UniqueId='5;#{8F6DF977-9814-4AA0-B7AE-E29838C508CF}' ows_owshiddenversion='3' ows_FSObjType='5;#0'
ows_PermMask='0x7fffffffffffffff' ows_FileRef='5;#sites/test-team/Lists/Main Task List/5_.000'
/>[\r][\n]"
...
DEBUG 2016-10-13 11:39:45,461 (Worker thread '45') - SharePoint: getListItems FileRef value
'sites/test-team/Lists/Main Task List/5_.000', xml response: '<ns1:listitems xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset"
xmlns:z="#RowsetSchema" xmlns:ns1="http://schemas.microsoft.com/sharepoint/soap/">
<rs:data ItemCount="1">
   <z:row ows_Modified="2016-10-13 10:24:51" ows_Created="2016-10-12 17:30:55" ows_ID="5"
ows_GUID="{E583E8D8-52A7-4CD8-8A5F-6354D57D1E40}" ows_MetaInfo="5;#" ows__ModerationStatus="0"
ows__Level="1" ows_Title="Task emoji &gt;&gt;&gt;&#xD83D;&#xDE00;&lt;&lt;&lt;"
ows_UniqueId="5;#{8F6DF977-9814-4AA0-B7AE-E29838C508CF}" ows_owshiddenversion="3" ows_FSObjType="5;#0"
ows_PermMask="0x7fffffffffffffff" ows_FileRef="5;#sites/test-team/Lists/Main Task List/5_.000"/>
</rs:data>
</ns1:listitems>'
DEBUG 2016-10-13 11:39:45,494 (Worker thread '45') - SharePoint: Can't get version of '/Main
Task List///5_.000' because of bad XML characters(?)
{code}

and the code & #128512 is a valid XML 1.0 code!

Could you please take a look at the parser?
Thank you!

> Invalid XML character causing job to abort
> ------------------------------------------
>
>                 Key: CONNECTORS-1325
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1325
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: SharePoint connector
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Phil
>            Assignee: Karl Wright
>            Priority: Blocker
>             Fix For: ManifoldCF 2.5
>
>         Attachments: CONNECTORS-1325-2.patch, CONNECTORS-1325-3.patch, CONNECTORS-1325.patch,
mcf-bad-ms-char.xml
>
>
> The following error is causing the Manifold job to abort, and subsequently the job not
being able to finish.
> It would be good to have the crawler log this error, but not throw an exception which
causes the entire job to stop.
> {code}
> ERROR 2016-06-21 19:01:54,562 (Worker thread '6') system.WorkerThread - Exception tossed:
XML parsing error: Character reference "&#xD83D" is an invalid XML character.
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: XML parsing error: Character
reference "&#xD83D" is an invalid XML character.
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:390)
>         at org.apache.manifoldcf.core.common.XMLDoc.<init>(XMLDoc.java:286)
>         at org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getFieldValues(SPSProxyHelper.java:2039)
>         at org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:974)
>         at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 64; Character
reference "&#xD83D" is an invalid XML character.
>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>         at org.apache.manifoldcf.core.common.XMLDoc.init(XMLDoc.java:359)
>         ... 4 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message