pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: 2 questions
Date Fri, 07 Mar 2014 11:27:20 GMT
Hi Julien,

composing words reliably from individual characters may not be 100% sure method. As you have
the advantage of being able to match a pattern you are looking for this will certainly help.
Will it always certainly be a 100% accurate - maybe not. What you could do is try the ExtractText
command line tool  http://pdfbox.apache.org/commandline/#extractText or PDFTextStripper to
extract text from your PDF and see what the results are and if the words you are looking for
are treated as such. 

BR
Maruan Sahyoun

Am 07.03.2014 um 12:16 schrieb Confidential Confidential <hqsoftwares@gmail.com>:

> Sirs,
> 
> I had already thought about this graphical approach to reconstruct the
> words. I've let it down because I'm a bit sceptical on the reliability of
> such a method. I can't help thinking that it will not be a 100% sure
> method. I understand why a CAD software would produce such an output,
> though (thank you for this new word that I didn't know "boustrophedonic",
> but it explains well the result obtained).
> 
> Supposing that the characters appear in a totally arbitrary order,
> detecting that they're on the same line is more or less piece of cake
> (except if I need to introduce a tolerance, which makes things more
> difficult), but grouping the characters according to their X position is
> not at all an easy task.
> 
> But this is not an issue, my problem is more the fact that this method may
> not be 100% reliable. What do you think ?
> 
> As for the technical part (overloading the processText), it's ok, thanks
> for the advice.
> 
> Best regards
> 
> Julien
> 
> 
> 
> 2014-03-06 18:39 GMT+01:00 HQS <hqsoftwares@gmail.com>:
> 
>> Hello all,
>> 
>> 1.
>> Have you ever seen PDFs having this kind of (pseudo) structure :
>> 
>> BT
>> <character>
>> Tj
>> ET
>> 
>> ?
>> 
>> Which means, the strings are split into characters and there is one block
>> of text per character ?
>> It seems to be ill-formed doesn't it ?
>> 
>> 2. Reminder of my first mail, what is the library compliancy regarding PDF
>> standards ? 1.3 to 1.7 ?
>> 
>> 
>> Thanks and regards
>> 
>> Julien
>> 
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message