pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jorgeeflorez <jorgeeduardoflo...@gmail.com>
Subject Re: Extracting page "correctly"
Date Tue, 06 Nov 2018 21:30:25 GMT
Thanks a lot Tilman for your help.

What it seems to me is that, regarding text extraction from a page, some
improvements can be made (I used PDFBox 2.0.11). The idea, I think, is that
one could just invoke a method and get the text of the page, just as you
would get it if you select the text from the page using Adobe Reader.

Looking at the code of LegacyPDFStreamEngine, ancestor of PDFTextStripper,
I found in several ocassions the expresion "THIS CODE IS DELIBERATELY
INCORRECT" (I don't know if this affects what I am trying to do). Anyway, I
made a subclass of PDFStreamEngine and tried to get the text of the page (I
am not familiar with the pdf specification, operators, fonts and all that
stuff). I just took some code from the examples, that I think I understood,
and added a couple lines.

I could extract the text of the file I used to test, regardless the page
rotation. I also used the pdf file from PDFBOX-4368 and it seems it got the
text correctly. In a third file I used, it took the text, but no spaces
between words (I guess spaces were not stored in the pdf).

I attached the test files and the class I created, I know it doesn't cover
all the cases, but maybe it can be helpful.

By the way, text extraction was a part of a bigger problem. I needed the
text of the page and also group text in words and store the coordinates (x,
y, width height) of each word. The grouping part I could do it (more or
less) but the first part was giving me trouble :)

Best Regards.
Jorge Eduardo Flórez

> I've been thinking about similar strategies for the same problem for
> some time but never worked on it.
> So yes, we could try all 4 rotations and then see what extract makes
> more sense.
> Another idea that I just came up with: take the
> DrawPrintTextLocations.java example from the source code download, then
> find this line
> AffineTransform at = text.getTextMatrix().createAffineTransform();
> below that, add this line:
> System.out.println("Angle: " + Math.toDegrees(Math.atan2(at.getShearY(),
> at.getScaleY())));
> Then look at the output....
> This gets the rotation angle, which will hopefully be one of 0, 90, 180,
> 270.
> Now run text extraction by preparing each page with
> page.setRotation(page.getRotation()-angle);
> However this won't work with fine rotations, e.g. the file from
> PDFBOX-4368.
> That would need something different, e.g. collecting all rotations, and
> then somehow run a filtered extract for each one.
> Tilman

View raw message