pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olaf Drümmer <olafl...@callassoftware.com>
Subject Re: 2 questions
Date Thu, 06 Mar 2014 20:30:18 GMT
You could use x and y position and rotation information to determine whether two given characters
- given their size - are relatively close to each other or not and are on the same line. 

BT / ET is not at all guaranteed to give you strings as perceived by a human.

Olaf


Am 6 Mar 2014 um 21:06 schrieb HQS <hqsoftwares@gmail.com>:

> Well, thanks sirs for your reactivity.
> 
> The PDFs are generated by Autodesk Inventor (even the latest version produces that kind
of output).
> 
> It is for one of my clients who wants an automatic transformation
> of some specific strings in the PDF into a clickable link.
> 
> My problem is very simple : with such a structure I have no way to know when the string
ends.
> 
> As a matter of fact all the references to be transformed are prefixed
> with an ‘I-‘ but there is no termination character, for instance : « I-HOIST-042
».
> Given that in the PDF I, -, H, O, (etc.), 2 are separated characters I cannot rebuild
the original string.
> 
> I was hoping that there is a block of text (BT … ET) but, as I mentioned, each character
is put in its own block...
> 
> Regards,
> 
> 
> Le 6 mars 2014 à 18:57, Maruan Sahyoun <sahyoun@fileaffairs.de> a écrit :
> 
>> Hi Julien,
>> 
>> for 1) that’s possible and supported - how was the document generated? DTP application?
>> for 2) PDFBox doesn’t enforce a PDF version. In general it supports all PDF files
but it doesn’t have full coverage of all features defined within certain PDF versions but
it should have a reasonable coverage. There is no documentation on coverage yet so I can’t
guarantee that a specific feature is supported. Is there something special you are looking
for?
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 06.03.2014 um 18:39 schrieb HQS <hqsoftwares@gmail.com>:
>> 
>>> Hello all,
>>> 
>>> 1.
>>> Have you ever seen PDFs having this kind of (pseudo) structure :
>>> 
>>> BT
>>> <character>
>>> Tj
>>> ET
>>> 
>>> ?
>>> 
>>> Which means, the strings are split into characters and there is one block of
text per character ?
>>> It seems to be ill-formed doesn't it ?
>>> 
>>> 2. Reminder of my first mail, what is the library compliancy regarding PDF standards
? 1.3 to 1.7 ?
>>> 
>>> 
>>> Thanks and regards
>>> 
>>> Julien
>>> 
>> 
> 


Mime
View raw message