pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Removing ALMOST all text from a pdf
Date Sun, 02 Dec 2018 09:16:22 GMT
Hi,

No there isn't... you'd have to look at the logic that is used in 
PDFStreamEngine.showText to convert the raw stuff into readable strings. 
It also depends on the current font.

And the problem is that a word will often be splitted on several tokens.

See
https://pdfbox.apache.org/2.0/migration.html

Why was the ReplaceText example removed?

Tilman

Am 02.12.2018 um 00:03 schrieb Nick Westerly:
> I'm using the method here to remove text from a document:
>
> http://www.docjar.com/html/api/org/apache/pdfbox/examples/util/RemoveAllText.java.html
>
> And then rendering the page to an image.
>
> I'd like to do exactly as I'm doing, except leave certain pieces of text if
> they match a regex pattern (i'm looking for sequences of dashes).
>
> For this part of the parsing, I'd like to implement a method that checks
> the textual representations of the prevToken, and only removes it if it
> doesn't match my string. Are there any helper methods to get the text here
> given an element like this (possibly in pdf text stripper or otherwise)? Or
> do i have to manually parse the text?
>
> for (Object token : tokens) {
>      if (token instanceof Operator) {
>          Operator op = (Operator) token;
>          if (op.getName().equals("TJ") || op.getName().equals("Tj")) {
>              //remove the one argument to this operator
>              Object prevToken = newTokens.get(newTokens.size() - 1);
>              if(!matchesMyString(prevToken)) {
>                  newTokens.remove(newTokens.size() - 1);
>              }
>              continue;
>          }
>      }
>      newTokens.add(token);
> }
>
> Thanks
>
> Nick
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message