pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank van der Hulst <drifter.fr...@gmail.com>
Subject Re: TextPosition in vb
Date Thu, 09 Jul 2015 23:07:45 GMT
Hi Tasha,
Yes, you will need to overload the PDFTextStripper class with your own
class which would iterate through each character in the document, get its
value and position, and assemble them into words.

There's no way to directly identify words in a PDF document... every
character is stored separately, along with its position (the TextPosition
object... its name is misleading since its actually a character+position).
You would need to extract characters, associate them together based on
their relative positions and fonts (perhaps handling super/subscripts) to
assemble the words yourself. Whilst it's (maybe) easy to do this for a
specific document or set of documents, it's not possible to have a generic
text extractor that works well in every case. PDFTextStripper won't work as
you might expect in every case.

I don't know about vb.net... I've only worked with PDFBox in Java.

Frank


On Fri, Jul 10, 2015 at 8:52 AM, Tasha Keppler <tasha@epolk.com> wrote:

> Thank you for this.
>
> If I understand correctly (please have patience with me) this is to create
> a separate class that overwrites the PDFTextStripper class. Is there a more
> simplified way of doing this using the TextPosition object instead? Since
> I'm working in .net I'm not sure the ease of translating this.
>
> Thank you again,
>
> Tasha Keppler
>
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Thursday, July 09, 2015 4:48 PM
> To: users@pdfbox.apache.org
> Subject: Re: TextPosition in vb
>
> Sorry, I see that I posted the unreleased 2.0 version. Here's the 1.8
> version:
>
> /*
>   * Licensed to the Apache Software Foundation (ASF) under one or more
>   * contributor license agreements.  See the NOTICE file distributed with
>   * this work for additional information regarding copyright ownership.
>   * The ASF licenses this file to You under the Apache License, Version 2.0
>   * (the "License"); you may not use this file except in compliance with
>   * the License.  You may obtain a copy of the License at
>   *
>   *      http://www.apache.org/licenses/LICENSE-2.0
>   *
>   * Unless required by applicable law or agreed to in writing, software
>   * distributed under the License is distributed on an "AS IS" BASIS,
>   * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> implied.
>   * See the License for the specific language governing permissions and
>   * limitations under the License.
>   */
> package org.apache.pdfbox.examples.util;
>
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage; import
> org.apache.pdfbox.pdmodel.common.PDStream;
> import org.apache.pdfbox.util.PDFTextStripper;
> import org.apache.pdfbox.util.TextPosition;
>
> import java.io.IOException;
>
> import java.util.List;
>
> /**
>   * This is an example on how to get some x/y coordinates of text.
>   *
>   * Usage: java org.apache.pdfbox.examples.util.PrintTextLocations
> &lt;input-pdf&gt;
>   *
>   * @author <a href="mailto:ben@benlitchfield.com">Ben Litchfield</a>
>   * @version $Revision: 1.7 $
>   */
> public class PrintTextLocations extends PDFTextStripper {
>      /**
>       * Default constructor.
>       *
>       * @throws IOException If there is an error loading text stripper
> properties.
>       */
>      public PrintTextLocations() throws IOException
>      {
>          super.setSortByPosition( true );
>      }
>
>      /**
>       * This will print the documents data.
>       *
>       * @param args The command line arguments.
>       *
>       * @throws Exception If there is an error parsing the document.
>       */
>      public static void main( String[] args ) throws Exception
>      {
>          if( args.length != 1 )
>          {
>              usage();
>          }
>          else
>          {
>              PDDocument document = null;
>              try
>              {
>                  document = PDDocument.load( args[0] );
>                  if( document.isEncrypted() )
>                  {
>                      document.decrypt( "" );
>                  }
>                  PrintTextLocations printer = new PrintTextLocations();
>                  List allPages =
> document.getDocumentCatalog().getAllPages();
>                  for( int i=0; i<allPages.size(); i++ )
>                  {
>                      PDPage page = (PDPage)allPages.get( i );
>                      System.out.println( "Processing page: " + i );
>                      PDStream contents = page.getContents();
>                      if( contents != null )
>                      {
>                          printer.processStream( page,
> page.findResources(), page.getContents().getStream() );
>                      }
>                  }
>              }
>              finally
>              {
>                  if( document != null )
>                  {
>                      document.close();
>                  }
>              }
>          }
>      }
>
>      /**
>       * A method provided as an event interface to allow a subclass to
> perform
>       * some specific functionality when text needs to be processed.
>       *
>       * @param text The text to be processed
>       */
>      protected void processTextPosition( TextPosition text )
>      {
>          System.out.println( "String[" + text.getXDirAdj() + "," +
>                  text.getYDirAdj() + " fs=" + text.getFontSize() + "
> xscale=" +
>                  text.getXScale() + " height=" + text.getHeightDir() + "
> space=" +
>                  text.getWidthOfSpace() + " width=" +
>                  text.getWidthDirAdj() + "]" + text.getCharacter() );
>      }
>
>      /**
>       * This will print the usage for this document.
>       */
>      private static void usage()
>      {
>          System.err.println( "Usage: java
> org.apache.pdfbox.examples.pdmodel.PrintTextLocations <input-pdf>" );
>      }
>
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message