pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: user a filter in a PDFStripper parsing
Date Fri, 09 Oct 2015 00:45:25 GMT
You want to subclass PDFTextStripper. It can do all the things you’ve mentioned.

— John

> On 7 Oct 2015, at 05:13, robyp7 . <robyp7@gmail.com> wrote:
> 
> hi
> 
> i would ask to you a question about PDFTextStripper:
> 
> I need to extract only some keyword/text patterns during the parsing of
> every pdf line ON EACH PAGE (NOT ALL PDF PAGES)
> 
> 
> for eg.
> 
> pdf like:
> ABC 123
> xyg 4
> zz 2
> 
> I only need to obtain a string text
> 
> ABC 123
> zzz 2
> 
> and i need also to get the page position of every text extracted
> 
> So i suppose to use a filter parsing
> 
> public class myFilter {
> 
> public accept( String text){
> ..
> }
> }
> 
> during the pdf parsing (line by line), pdfBox  call method accept
> 
> Isn't there something like an Estenxion (aka specialization/implementation)
> that do this, and how add for PDFBox?
> 
> Im checking the source code but i cant find it.. I check that method
> writeText return all pages and not each one..
> 
> If there isnt a solution i have to make filter parsing on entire text
> string and use tag page
> 
> Page n 1
> ABC 123
> xyg 4
> zz 1
> 
> ..
> ..
> 
> Page n 2
> ABC 456
> xyhk
> zz 2


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message