pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Keggenhoff <keggenh...@conclude.com>
Subject AW: Merging freetext annotations with missing spans
Date Tue, 16 Oct 2018 08:47:21 GMT
Hello, 

I tracked down this behaviour to FDFAnnotation.richContentsToString

This method ignores Text nodes if they are siblings of Elements and therefore the rich contents
of the annotation lack those parts.

Since I have several examples of Adobe Acrobat Reader DC producing this structure I consider
this a a bug in PDFBox.

Best regards,

Kai Keggenhoff

-----Urspr√ľngliche Nachricht-----
Von: Kai Keggenhoff <keggenhoff@conclude.com> 
Gesendet: Montag, 24. September 2018 11:10
An: users@pdfbox.apache.org
Betreff: AW: Merging freetext annotations with missing spans

Hello, 

in addition to my old email I would like to add sample code which produces two PDF files showing
the difference between freetext annotations containing

<p dir="ltr"><span style="font-family:Helvetica">P1 </span><span style="text-decoration:word;font-family:Helvetica">P2</span><span
style="font-family:Helvetica"> P3</span></p>

in contrast to 

<p dir="ltr">P1 <span style="text-decoration:word;font-family:Helvetica">P2</span>
P3</p>

The former produces the expected "P1 P2 P3", the latter shows only "P2".

For my tests I used PDFBox 2.0.11.

Thanks in advance,

Kai Keggenhoff



package xfdfannotation;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.fdf.FDFAnnotation;
import org.apache.pdfbox.pdmodel.fdf.FDFDocument;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;

import javax.xml.parsers.DocumentBuilderFactory;
import org.xml.sax.InputSource;
import java.io.StringReader;
import java.util.List;

public class MergeText {
	public static void main(String args[]) {

		String xfdf_without_spans = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<xfdf xmlns=\"http://ns.adobe.com/xfdf/\" xml:space=\"preserve\"" +
"><annots" +
"><freetext color=\"#FFFFFF\" creationdate=\"D:20180924102518+02'00'\" flags=\"print\"
date=\"D:20180924102537+02'00'\" page=\"0\" rect=\"17.382233,685.894287,121.675568,758.765869\"
subject=\"Textfeld\" title=\"keggenhoff\"" +
"><contents-richtext" +
"><body xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:xfa=\"http://www.xfa.org/schema/xfa-data/1.0/\"
xfa:APIVersion=\"Acrobat:18.11.0\" xfa:spec=\"2.0.2\" style=\"font-size:12.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Helvetica,sans-serif;font-stretch:normal\""
+
"><p dir=\"ltr\"" +
">P1 <span style=\"text-decoration:word;font-family:Helvetica\"" +
">P2</span" +
"> P3</p" +
"></body" +
"></contents-richtext" +
"><defaultappearance" +
">0.898 0.1333 0.2157 rg /Helv 12 Tf</defaultappearance" +
"><defaultstyle" +
">font: Helvetica,sans-serif 12.0pt; text-align:left; color:#E52237 </defaultstyle"
+
"></freetext" +
"></annots" +
"><f href=\"/C/Users/KEGGEN~1/AppData/Local/Temp/demo.pdf\"" +
"/><fields" +
"><field name=\"submit\"" +
"/></fields" +
"><ids original=\"F285D06ECA30C5579E72B6B7AE07BC0B\" modified=\"1A190CB840919E279B93BF3D5D488C13\""
+
"/></xfdf" +
">";

		String xfdf_with_spans = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<xfdf xmlns=\"http://ns.adobe.com/xfdf/\" xml:space=\"preserve\"" +
"><annots" +
"><freetext color=\"#FFFFFF\" creationdate=\"D:20180924102518+02'00'\" flags=\"print\"
date=\"D:20180924102537+02'00'\" page=\"0\" rect=\"17.382233,685.894287,121.675568,758.765869\"
subject=\"Textfeld\" title=\"keggenhoff\"" +
"><contents-richtext" +
"><body xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:xfa=\"http://www.xfa.org/schema/xfa-data/1.0/\"
xfa:APIVersion=\"Acrobat:18.11.0\" xfa:spec=\"2.0.2\" style=\"font-size:12.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Helvetica,sans-serif;font-stretch:normal\""
+
"><p dir=\"ltr\"" +
"><span style=\"font-family:Helvetica\"" +
">P1 </span" +
"><span style=\"text-decoration:word;font-family:Helvetica\"" +
">P2</span" +
"><span style=\"font-family:Helvetica\"" +
"> P3</span" +
"></p" +
"></body" +
"></contents-richtext" +
"><defaultappearance" +
">0.898 0.1333 0.2157 rg /Helv 12 Tf</defaultappearance" +
"><defaultstyle" +
">font: Helvetica,sans-serif 12.0pt; text-align:left; color:#E52237 </defaultstyle"
+
"></freetext" +
"></annots" +
"><f href=\"/C/Users/KEGGEN~1/AppData/Local/Temp/demo.pdf\"" +
"/><fields" +
"><field name=\"submit\"" +
"/></fields" +
"><ids original=\"F285D06ECA30C5579E72B6B7AE07BC0B\" modified=\"1A190CB840919E279B93BF3D5D488C13\""
+
"/></xfdf" +
">";
		createPdf("demo_no_spans.pdf", xfdf_without_spans);
		createPdf("demo_with_spans.pdf", xfdf_with_spans);

	}

	private static void createPdf(String filename, String xfdf) {
		try {
			org.w3c.dom.Document xfdf_doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new
InputSource(new StringReader(xfdf)));
			FDFDocument fdf_doc = new FDFDocument(xfdf_doc);

			PDPage page = new PDPage();

			List<FDFAnnotation> xfdfAnnotations = fdf_doc.getCatalog().getFDF().getAnnotations();
			for (FDFAnnotation xfdfAnnotation : xfdfAnnotations) {
				PDAnnotation a = PDAnnotation.createAnnotation(xfdfAnnotation.getCOSObject());
				page.getAnnotations().add(a);
			}

			PDDocument pdf = new PDDocument();
			pdf.addPage(page);
			pdf.save(filename);
		}
		catch (Exception e) {
			e.printStackTrace();
		}
	}
}




-----Urspr√ľngliche Nachricht-----
Von: Kai Keggenhoff <keggenhoff@conclude.com> 
Gesendet: Donnerstag, 13. September 2018 13:44
An: users@pdfbox.apache.org
Betreff: Merging freetext annotations with missing spans

Hello,

I'm working on an application which merges XFDF files with annotations with PDF files and
noticed some strange behaviour with certain types of text annotations.

It looks like text that is not contained in a span is ignored when merging.

One user uploaded this annotation (not the actual texts) from an older Acrobat :

<freetext width="2.000000" color="#FFFFFF" creationdate="D:20180910162711+02'00'" flags="print"
date="D:20180911172716+02'00'" page="0" rect="1136.342529,3886.797363,1221.432617,4367.977539"
rotation="90" subject="Textfeld" title="username"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/"
xfa:APIVersion="Acrobat:11.0.15" xfa:spec="2.0.2" style="font-size:11.0pt;text-align:left;color:#0000FF;font-weight:bold;font-style:normal;font-family:Arial;font-stretch:normal"
><p dir="ltr"
>ABC <span style="text-decoration:underline"
>DEF</span
> GHI&#xD;</p
><p dir="ltr"
><span style="font-weight:normal"
>More text&#xD;</span
></p
><p dir="ltr"
><span style="font-weight:normal"
>More text</span
></p
><p dir="ltr"
><span style="font-weight:normal"
>More text</span
></p
></body
></contents-richtext
><defaultappearance
>0 0 1 rg /Arial,Bold 11 Tf</defaultappearance
><defaultstyle
>font: bold Arial 11.0pt; text-align:left; color:#0000FF </defaultstyle
></freetext
>

After merging, the texts "ABC" and "GHI" are gone - they are not displayed and not shown in
the comments area in Acrobat Reader.

When I tried to create a similar annotation using a current Acrobat Reader DC, I get

<freetext color="#FFFFFF" creationdate="D:20180913132943+02'00'" flags="print" date="D:20180913132956+02'00'"
page="0" rect="181.799377,672.266907,326.595337,723.213623" subject="Textfeld" title="keggenhoff"
><contents-richtext
><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/"
xfa:APIVersion="Acrobat:18.11.0" xfa:spec="2.0.2" style="font-size:12.0pt;text-align:left;color:#FF0000;font-weight:normal;font-style:normal;font-family:Helvetica,sans-serif;font-stretch:normal"
><p dir="ltr"
><span style="font-family:Helvetica"
>ABC</span
><span style="text-decoration:word;font-family:Helvetica"
> DEF</span
><span style="font-family:Helvetica"
> GHI</span
></p
></body
></contents-richtext
><defaultappearance
>0.898 0.1333 0.2157 rg /Helv 12 Tf</defaultappearance
><defaultstyle
>font: Helvetica,sans-serif 12.0pt; text-align:left; color:#E52237 </defaultstyle
></freetext
>

When I merge this annotation with the PDF, the text is complete.
However, when I remove the span tags around ABC and GHI, both texts are again missing after
merging.

Now my question is whether the (ancient) Acrobat should have included span tags there or if
PDFBox should process the text that is not inside a span.

I tested this with PDFBox 2.0.6 and 2.0.11 and the behaviour was identical.

Thanks in advance,

Kai Keggenhoff

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message