pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Monday <paul.mon...@docsforce.com>
Subject Re: How to define regions in PDFTextStripperByArea?
Date Tue, 29 Apr 2014 13:00:46 GMT
It's not really PDFBox that mixed the main content up.  It's just a basic algorithm for extracting
text.  You run into this quite often when interpreting PDF files.  I've been playing with
this all week so I actually have some code.

Theere are two things you can try.  You could get the rectangle that the cropbox defines and
have the text stripper attempt to sort by position.  Depending on how your headers and footers
were inserted, this may sort it out.  Here is where I did that on a per page basis:

		for (PDPage page : pages) {
			PDRectangle pdr = page.getCropBox();
			Rectangle rec = new Rectangle();
			rec.setBounds(
					Math.round(pdr.getLowerLeftX())
					, Math.round(pdr.getLowerLeftY())
					, Math.round(pdr.getWidth())
					, Math.round(pdr.getHeight()));
			System.out.println("Crobox: " + rec);
			PDFTextStripperByArea stripper = new PDFTextStripperByArea();
			stripper.addRegion("cropbox", rec);
			stripper.setSortByPosition(true);
			stripper.extractRegions(page);
			List<String> regions = stripper.getRegions();
			for (String region : regions) {
				String text = stripper.getTextForRegion(region);

This may sort your strings in the order you want.

Otherwise, refine the regions in your cropbox, here I define three separate rectangles using
the dimensions of the cropbox

		int width = 612;
		int height = 792;
		
		int hX = 320, tX = 340, cX = 100;
		int hY = 0, tY = 580, cY = 200;
		int hW = width - hX, tW = width - tX, cW = 100;
		int hH = 80, tH = height - tY, cH = 60;
		
		Rectangle header = new Rectangle();
		header.setBounds(hX, hY, hW, hH);
		Rectangle totals = new Rectangle();
		totals.setBounds(tX, tY, tW, tH);
		Rectangle customer = new Rectangle();
		customer.setBounds(cX, cY, cW, cH);
		
		PDFTextStripperByArea stripper = new PDFTextStripperByArea();
		stripper.addRegion("header", header);
		stripper.addRegion("totals", totals);
		stripper.addRegion("customer", customer);
		stripper.setSortByPosition(true);
		
		int j = 0;
		List<PDPage> pages = pd.getDocumentCatalog().getAllPages();
		for (PDPage page : pages) {
			stripper.extractRegions(page);
			List<String> regions = stripper.getRegions();
			for (String region : regions) {
				String text = stripper.getTextForRegion(region);
				System.out.println("Region: " + region + " on Page " + j);
				System.out.println("\tText: \n" + text);
			}
			j++;
		}

Hope that helps.  Apologies if I made any mistakes, I'm new to extracting by region as well
:-)

Remember the regions are X,Y = 0,0 at the top left corner of the page.

Paul
On Apr 29, 2014, at 12:38 AM, Qingchao Kong <kqingchao@gmail.com> wrote:

> Hi! I am using PDFBox to extract text from PDF files. One problem I am
> facing is: PDFBox mixed the main content up with the PDF
> footer(header) sections and I want to ignore the footer/header
> sections.
> 
> I did some research and find that class PDFTextStripperByArea is a
> promising solution. But could someone tell me: how to set the
> Rectangle2D object in method "addRegion"?
> 
> To be more specific, here is some example code:
> 
> PDFTextStripperByArea stripper = new PDFTextStripperByArea();
> Rectangle rect = new Rectangle( x, y, width, height );
> stripper.addRegion( "class1", rect );
> 
> What does x, y, width and height mean? And how to set their values?
> 
> Thanks!

Paul Monday
paul.monday@docsforce.com




Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message