pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hesham G." <heshamgne...@gmail.com>
Subject Re: Spaces are ignored when reading a PDF file
Date Thu, 17 Mar 2016 20:19:26 GMT
John ,

I have checked the PrintTextLocations.java example. I have tested using this code for the
"With due" term in my book sample, using this code:
System.out.println( "String[" + text.getCharacter() + ": " + text.getXDirAdj() + "," +
                                text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
+
                                text.getXScale() + " height=" + text.getHeightDir() + " space="
+
                                text.getWidthOfSpace() + " width=" + text.getWidthDirAdj()
+ "]" );
And here are the results:
String[W: 102.88399,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=11.9552]
String[i: 114.18165,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=3.4789658]
String[t: 117.660614,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=3.8973923]
String[h: 121.55801,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=6.957924]
String[d: 133.09477,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=7.3046265]
String[u: 140.3994,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=7.2089844]
String[e: 147.60838,169.591 fs=11.9552 xscale=11.9552 height=7.328538 space=2.9888 width=5.7265472]

So which method do you mean? .. The getXDirAdj() ?


Best regards ,
Hesham

------------------------------------------------------------------------
Included message :

I’m rather confused by this thread, inferring spaces is one of the the main features of
PDFTextStripper. I’m not sure why anyone is suggesting to process the text manually - there’s
no need to do that. We do that already!

Looking at the original code the problem is right here:

> public class PDFTextStripperProcessor extends PDFTextStripper {
>    @Override
>    public void processTextPosition( TextPosition text )  {
>        System.out.println( text.getCharacter() );
>    }
> }

The processTextPosition method is used to pass an unprocessed TextPosition *in* to PDFTextStripper,
but this override prevents that from happening, and is just printing the unprocessed token
before PDFTextStripper has had a chance to do its job, such as inferring the missing spaces.

You should follow our PrintTextLocations.java example which shows you how to get the processed
TextPositions from PDFTextStripper. It’s really easy to do.

— John

> On 17 Mar 2016, at 04:44, Hesham G. <heshamgneady@gmail.com> wrote:
> 
> Andreas,
> 
> You're absolutely right. I am testing it now, but it seems very complicated. I hope there
might be another easier solution.
> 
> 
> Best regards ,
> Hesham
> 
> ------------------------------------------------------------------------
> Included message :
> 
>> "Hesham G." <heshamgneady@gmail.com> hat am 17. März 2016 um 11:20
>> geschrieben:
>> 
>> 
>> Andreas,
>> 
>> That is very helpful.
>> 
>> I can get the x location of each character using TextPosition.getX(), ex:
>> W: 102.88399
>> i: 114.18165
>> t: 117.660614
>> h: 121.55801
>> d: 133.09477
>> u: 140.3994
>> e: 147.60838
>> 
>> So to detect the space between the 2 words "With" & "due" should I make
>> subtraction calculations between X of the last letter(h) and the X of the
>> first letter (d) and if the number is large than normal then this is a
>> space? I think this way might be risky in the detection, or what?
> That's the short story. To decide what is normal could be quite tricky. You have
> to take the following facts into account:
> 
> - different fonts have different widths (important if the font before the space
> isn't the same than the font after the space)
> - keep in mind that you have to take a scaling and sometimes a rotation into
> account
> - the "space" between characters may vary if the text is jusitified
> 
> There are certainly some other details which may be important as well, so that
> you end up with some more or less heuristic.
> 
> BR
> Andreas
> 
>> Best regards ,
>> Hesham
>> 
>> ------------------------------------------------------------------------
>> Included message :
>> 
>> Hi,
>> 
>> > Frank van der Hulst <drifter.frank@gmail.com> hat am 17. März 2016 um
>> > 08:34
>> > geschrieben:
>> >
>> >
>> > Spaces don't exist as characters in PDFs. To identify spaces, you have >
to
>> > compare the X coordinates of adjacent characters against their widths.
>> That's not correct, spaces exist but in most cases pdf engines omit them and
>> replace spaces by a splitted text with an appropriate positioning.
>> 
>> BTW, latex uses the same strategy. Here is a excerpt from your pdf:
>> 
>>   [ (W) 55 (ith) -383 (due) -384 (r) 18 (egar) 18 (d) -383 (to) -383
>> (Article)
>> -384 (\(219\),) -416 (the) -384 (competent) -383 (authority) -383 (has) -384
>> (the) -383 (right) ] TJ
>> 
>> The text is in between the braces and the numbers are used for horizontal
>> positioning.
>> 
>> BR
>> Andreas
>> 
>> >
>> > On Thu, Mar 17, 2016 at 7:12 PM, Hesham G. <heshamgneady@gmail.com> >
wrote:
>> >
>> > > Hello ,
>> > >
>> > > I have a PDF file created using Latex. I am trying to read and print >
> all
>> > > letters in that file using PDFBox, but when doing this all spaces in
>> > > that
>> > > file are ignored. Here is the code I am using:
>> > > PDPage page = (PDPage)allPages.get( 0 );
>> > > PDStream contents = page.getContents();
>> > > if ( contents != null ) {
>> > >     PDFTextStripperProcessor pdfTextStripperProcessor = new
>> > > PDFTextStripperProcessor();
>> > >     pdfTextStripperProcessor.processStream( page, > > page.findResources(),
>> > > contents.getStream() );
>> > > }
>> > >
>> > > public class PDFTextStripperProcessor extends PDFTextStripper {
>> > >     @Override
>> > >     public void processTextPosition( TextPosition text )  {
>> > >         System.out.println( text.getCharacter() );
>> > >     }
>> > > }
>> > >
>> > > And you can check a one page file sample here to test it:
>> > >
>> > > https://dl.dropboxusercontent.com/u/10111483/downloads/pdfbox/pdf_latex_spaces_ignored.pdf
>> > >
>> > > What is the cause of this issue please?
>> > >
>> > >
>> > > Best regards ,
>> > > Hesham
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message