pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: How can I manipulate text in PDF'd by using PDFBox
Date Sat, 01 Sep 2012 11:24:23 GMT

Am 01.09.2012 04:24, schrieb Mac P:
> Hello Forum
> Is there any way to to split a master pdf file consisted of so many pages into separate
pages based on the content or keywords in each page?
> Each page has the person's first and last name. I would like to grep the last name and
write a scripts to separate each page, turn it into a new pdf file with the last name being
part of the file name instead of sequential numbers matching the total number of pages at
the end of each file name.
> I know PDFs are binary documents. Are there any tools to look up the last names and manipulate
them that way?
Use PDFSplit [1] to split your pdf in single pages and ExtractText [2] to get 
the string your looking for. The first goal should work out of the box the 
latter could be complicated depending on the used fonts etc. Just give it a try.

> Thanks
> Mac

Andreas Lehmkühler

P.S.: Subscribe yourself correctly to the mailing-list [3], otherwise you won't 
get any answer.

[1] http://pdfbox.apache.org/commandlineutilities/PDFSplit.html
[2] http://pdfbox.apache.org/commandlineutilities/ExtractText.html
[3] http://pdfbox.apache.org/mail-lists.html

View raw message