From dev-return-57679-archive-asf-public=cust-asf.ponee.io@pdfbox.apache.org Thu Aug 2 11:37:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5DE62180629 for ; Thu, 2 Aug 2018 11:37:03 +0200 (CEST) Received: (qmail 43412 invoked by uid 500); 2 Aug 2018 09:37:02 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 43401 invoked by uid 99); 2 Aug 2018 09:37:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Aug 2018 09:37:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0E7F9C2752 for ; Thu, 2 Aug 2018 09:37:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.301 X-Spam-Level: X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id JMCUW-XIwZMH for ; Thu, 2 Aug 2018 09:37:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 196DB5F107 for ; Thu, 2 Aug 2018 09:37:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 96C91E0F39 for ; Thu, 2 Aug 2018 09:37:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4DEFE2775E for ; Thu, 2 Aug 2018 09:37:00 +0000 (UTC) Date: Thu, 2 Aug 2018 09:37:00 +0000 (UTC) From: "David KELLER (JIRA)" To: dev@pdfbox.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (PDFBOX-4284) LibreOffice6 PDF Conversion broke PDFTextStripper result MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PDFBOX-4284?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D165= 66543#comment-16566543 ]=20 David KELLER commented on PDFBOX-4284: -------------------------------------- As you can see on [^libreoffice_6.0.txt] you have "Le Maire, #siginatuire#" and if you open the PDF with Acrobat Reader you will have "Le Maire, #sigin= ature#" without the "i" =C2=A0 =C2=A0 > LibreOffice6 PDF Conversion broke PDFTextStripper result =20 > ---------------------------------------------------------- > > Key: PDFBOX-4284 > URL: https://issues.apache.org/jira/browse/PDFBOX-4284 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 3.0.0 PDFBox > Environment: Window 10 and CentOS7 > Reporter: David KELLER > Priority: Major > Labels: features > Attachments: libreoffice_5.2.pdf, libreoffice_5.2.txt, libreoffic= e_6.0.pdf, libreoffice_6.0.txt, original-document.docx > > > here the test program: > {{public class ExtractTextPdfTest {}} > {{=C2=A0=C2=A0 =C2=A0}} > {{=C2=A0=C2=A0 =C2=A0public static void main(String[] args) throws Excep= tion {}} > {{=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0// #7272}} > {{//=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0String documentIn =3D "c:\\dat= a\\test}} > {{libreoffice_5.2.pdf";}} > {{=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0String documentIn =3D "c:\\data\= \test}} > {{libreoffice_6.0.pdf";}} > {{=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0}} > {{=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0try (PDDocument pdDocument =3D P= DDocument.load(new File(documentIn))) {}} > {{=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0PDFTextStripp= er stripper =3D new PDFTextStripper();}} > {{=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0String conten= t =3D stripper.getText(pdDocument);}} > {{=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0System.out.pr= intln(content);}} > {{=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0}}} > {{=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0}} > {{=C2=A0=C2=A0 =C2=A0}}} > {{}}} > =C2=A0 > 1/=C2=A0 run PDFTextStripper on a Word document converted by LibreOffice = 5.2 in PDF > =C2=A0result : > {quote}R=C3=A9f : #chrono# Le #date# > Affaire suivie par : > #recipient.salutation# > #recipient.name# > #recipient.streetNumber# > #recipient.streetName# > #recipient.zipCode# > #recipient.locality# > #object# > #recipient.salutation#, > Nous=C2=A0 avons=C2=A0 bien=C2=A0 re=C3=A7u=C2=A0 votre=C2=A0 candidatur= e=C2=A0 pour=C2=A0 le=C2=A0 poste=C2=A0 de=E2=80=A6=E2=80=A6=E2=80=A6=E2=80= =A6=E2=80=A6=E2=80=A6=E2=80=A6=E2=80=A6=E2=80=A6=E2=80=A6.=C2=A0 et=C2=A0 n= ous=C2=A0 vous > remercions de l=E2=80=99int=C3=A9r=C3=AAt que vous portez =C3=A0 notre a= dministration. > Afin d'examiner votre candidature de mani=C3=A8re plus compl=C3=A8te, no= us souhaiterions vous rencontrer. > Aussi, nous vous proposons un rendez-vous en nos locaux avec M ... , res= ponsable du service de ... , le > ... =C3=A0 ... heures. > Nous vous prions d=E2=80=99agr=C3=A9er, #recipient.salutation#, l=E2=80= =99expression de nos salutations distingu=C3=A9es. > Le Maire, > #signature# > {quote} > =C2=A0 > 2/=C2=A0 run PDFTextStripper on the same Word document converted by Libre= Office 6.0=C2=A0 in PDF > =C2=A0 > =C2=A0result : > {quote}R=C3=A9f : Destinataire > Affaire suiiiie aar : Adresse > Code Postal > Ville > Paris, le 25/07/2018 > Madame, Moinsieuir > Nous avons le plaisir de vous informer que suite =C3=A0 la Commission d= =E2=80=99Attribution de Logement=20 > qui s=E2=80=99est tenue le xx/xx/xxxx, nous avons d=C3=A9cid=C3=A9 de vo= us attribuer le logement situ=C3=A9 au xx=20 > rue xxxxxxxxxxxxxxxxxxxx, 75 000 Paris. > Les caract=C3=A9ristiuies de ce logemeint soint les suiiiaintes :=20 > =EF=80=AD Suirface habitable : > =EF=80=AD Tyae de logemeint : > =EF=80=AD Garage/Parkiing : > =EF=80=AD Mointaint dui loyer : > =EF=80=AD Mointaint des charges : > =EF=80=AD Mointaint dui d=C3=A9a=C3=B4t de garainte : > =EF=80=AD Date d=E2=80=99eintr=C3=A9e dains=C2=A0 les lieuix : > Les s mointaints ar=C3=A9cis=C3=A9s soint =C3=A0 d=C3=A9duiire, le cas = =C3=A9ch=C3=A9aint, de l'aide aui logemeint (APL, AL) calcuil=C3=A9e et =C2= =A0 > commuiiniiui=C3=A9e aar iotre Caisse d'allocatoins familiales. > Vouis=C2=A0 aiez=C2=A0 juisiui=E2=80=99aui=C2=A0 xx/xx/xx=C2=A0 aouir=C2= =A0 inouis=C2=A0 siginifer=C2=A0 l=E2=80=99acceatatoin=C2=A0 de=C2=A0 ce=C2= =A0 logemeint=C2=A0 aar=C2=A0 letre=20 > recommaind=C3=A9e aiec accuis=C3=A9 de r=C3=A9ceatoin. > Vouis ariaint d=E2=80=99agr=C3=A9er, Madame, Moinsieuir, l=E2=80=99exare= ssioin de mes saluitatoins distingui=C3=A9es. > Le Maire, > #siginatuire# > {quote} > =C2=A0 > =C2=A0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org