pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Keggenhoff (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PDFBOX-4345) FDFAnnotation.richContentsToString
Date Wed, 17 Oct 2018 08:19:00 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kai Keggenhoff updated PDFBOX-4345:
-----------------------------------
    Description: 
The method FDFAnnotation.richContentsToString does not evaluate text nodes which have siblings
in the XML which can lead to missing text when you parse XFDF data and add the annotations
to a PDF.

Example : parsing a XFDF string containing

<p>Text A <span style="text-decoration:word;">Text B</span> Text C</p>

and adding the annotation will display only "+Text B+".

I've included a code sample (MergeTest.java) which generates two PDFs.
 For one PDF, the paragraph contains only spans with text nodes as their only children and
all the text is included, for the other PDF, the paragraph has mixed text nodes and elements
as children and here, the content from the text siblings of the "span" is missing.

I propose the following fix:

Instead of traversing the children of an element with the XPath "*" expression, simply iterate
the children obtained from Node.getChildNodes(), process Text and CDATASection nodes directly
and call richContentsToString for any elements.

(source : FDFAnnotation_new.java, diff to 2.0.12 : FDFAnnotation_diff.txt)

Note : my first attempt of a fix was to replace the XPath "*" expression with "node()", but
for some reason, when I used this on a test case of

<p><![CDATA[A]]> B <span>C</span> D</p>

I would only obtain a NodeList containing the CDATASection, the "span" element and the final
text node, but not the text node containing "B".

  was:
The method FDFAnnotation.richContentsToString does not evaluate text nodes with siblings in
the XML which can lead to missing text when you parse XFDF data and add the annotations to
a PDF.

Example : parsing a XFDF string containing

<p>Text A <span style="text-decoration:word;">Text B</span> Text C</p>

and adding the annotation will display only "+Text B+".

I've included a code sample (MergeTest.java) which generates two PDFs.
For one PDF, the paragraph contains only spans with text nodes as their only children and
all the text is included, for the other PDF, the paragraph has mixed text nodes and elements
as children and here, the content from the text siblings of the "span" is missing.

I propose the following fix:

Instead of traversing the children of an element with the XPath "*" expression, simply iterate
the children obtained from Node.getChildNodes(), process Text and CDATASection nodes directly
and call richContentsToString for any elements.

(source : FDFAnnotation_new.java, diff to 2.0.12 : FDFAnnotation_diff.txt) 

Note : my first attempt of a fix was to replace the XPath "*" expression with "node()", but
for some reason, when I used this on a test case of

<p><![CDATA[A]]> B <span>C</span> D</p>

I would only obtain a NodeList containing the CDATASection, the "span" element and the final
text node, but not the text node containing "B".


> FDFAnnotation.richContentsToString
> ----------------------------------
>
>                 Key: PDFBOX-4345
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4345
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 2.0.12
>            Reporter: Kai Keggenhoff
>            Priority: Major
>         Attachments: FDFAnnotation_diff.txt, FDFAnnotation_new.java, MergeTest.java
>
>
> The method FDFAnnotation.richContentsToString does not evaluate text nodes which have
siblings in the XML which can lead to missing text when you parse XFDF data and add the annotations
to a PDF.
> Example : parsing a XFDF string containing
> <p>Text A <span style="text-decoration:word;">Text B</span> Text C</p>
> and adding the annotation will display only "+Text B+".
> I've included a code sample (MergeTest.java) which generates two PDFs.
>  For one PDF, the paragraph contains only spans with text nodes as their only children
and all the text is included, for the other PDF, the paragraph has mixed text nodes and elements
as children and here, the content from the text siblings of the "span" is missing.
> I propose the following fix:
> Instead of traversing the children of an element with the XPath "*" expression, simply
iterate the children obtained from Node.getChildNodes(), process Text and CDATASection nodes
directly and call richContentsToString for any elements.
> (source : FDFAnnotation_new.java, diff to 2.0.12 : FDFAnnotation_diff.txt)
> Note : my first attempt of a fix was to replace the XPath "*" expression with "node()",
but for some reason, when I used this on a test case of
> <p><![CDATA[A]]> B <span>C</span> D</p>
> I would only obtain a NodeList containing the CDATASection, the "span" element and the
final text node, but not the text node containing "B".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message