pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From George Van Treeck <tre...@yahoo.com>
Subject Re: Bug or known limitation?
Date Tue, 15 Dec 2009 18:31:16 GMT
Thanks. I'll modify my local sources to ingore the "PS" subtype.

Also, I recommend the following code changes to fix problems that I have run into with pdfbox:

Line 970 in pdrfparser.BaseParser to avoid an exception where the PDF contains an invalid
value, like "t", etc.
<< if (trueString.equals("true"))
>> if ("true".startsWith(trueString))

Insert below line 240 in cos.COSString to guess at the appropriate charcter type. I know the
code is not "right". But, it seems to avoid an exception in some PDFs that I crawl:
>> else if( data[0] >= (byte)0xC0 && data[0] <= (byte)0xFD )
>> {
>>   encoding = "UTF-8";
>>   start=2;
>> }

A lot of people have asked for this feature and some "experts" have replied that it's not
possible. So, here is some code to implement this impossible. Below is some code to unscramble
text so that each line of text in one column is joined with text that is roughly aligned
with the lines an ajacent column. I'm know next to nothing about PDFs, so I'm sure there many
use-cases this does not cover. But, 10% of a feature is better than 0%.... Konwing a lot
more about PDFs than I, you might want to inspect these changes to util.PDFTextStripper and
rewrite in a more appropriate manner.

Insert at line 438
    /**
     * This sorting is handles text aligned into columns by using
     * column-based alignment to determine the text ordering.
     * Specifically, vertically adjacent items items are grouped into sets,
     * where each set contains adjacent items with same x (left horizontal)
     * coordinate. If a horizontally left-adjacent text item is part of a set
     * containing other vertically adjacent text items at the same x coordinate,
     * then the items in the first set are separate column and are all added to
     * the list first, followed by the horizontally adjacent set.
     * 
     * @param textList
     */
    @SuppressWarnings("unchecked")
    protected void sortByPosition(List<TextPosition> textList) {
      /**
       * An array of sets, each set containing a sublist of text items
       * all starting at the same column border.
       */
      final HashMap<Float, ArrayList<TextPosition>> set_map =
        new HashMap<Float, ArrayList<TextPosition>>();
      
      final int TEXT_LIST_SIZE = textList.size();
      if (TEXT_LIST_SIZE <= 1)
        return; // nothing to sort
      
      // Group into sets.
      Iterator<TextPosition> textIter = textList.iterator();
      while( textIter.hasNext() )
      {
          TextPosition position = textIter.next();
          float positionX = position.getXDirAdj();
          ArrayList<TextPosition> set = set_map.get( positionX );
          if (set == null)
          {
            set = new ArrayList<TextPosition>();
            set_map.put( positionX, set );
          }
          set.add( position );
      }
      
      // Sort each set
      final int MAP_SIZE = set_map.size();
      if (MAP_SIZE > 0) {
        // First, sort the sets.
        Iterator<Float> mapIter = set_map.keySet().iterator();
        final ArrayList<Float> map_index = new ArrayList<Float>(MAP_SIZE);
        while ( mapIter.hasNext() )
          map_index.add( mapIter.next() );
        // Sort by x coordinate of column margin.
        Collections.sort(map_index);
        // Second, sort within each set.
        for (int i = 0; i < MAP_SIZE; i++)
        {
          ArrayList<TextPosition> set = set_map.get( map_index.get(i) );
          if (set.size() > 1)
          {
            TextPositionComparator comparator = new TextPositionComparator();
            Collections.sort( set, comparator );
          }
        }
        // Third, coalesce horizontally adjacent text items.
        // Fourth, re-order the textList.
        for (int i = 0; i < MAP_SIZE; i++)
        {
          ArrayList<TextPosition> set = set_map.get( map_index.get(i) );
          Iterator<TextPosition> setIter = set.iterator();
          while ( setIter.hasNext() )
            textList.add( setIter.next() );
        }
      }
    }

Lines 462, 463:
<< TextPositionComparator comparator = new TextPositionComparator();
<< Collections.sort( textList, comparator );
>> sortByPosition(textList);


Thanks,
George Van Treeck



----- Original Message ----
From: Andreas Lehmkühler <andreas@lehmi.de>
To: George Van Treeck <treeck@yahoo.com>; users@pdfbox.apache.org
Sent: Tue, December 15, 2009 4:34:45 AM
Subject: Re: Bug or known limitation?

Hi,

Gesendet: Di, 15. Dez 2009 Von: George Van Treeck<treeck@yahoo.com>

> I ran into the exception below when using an older 0.8 version. So, I did a
> build using HEAD from subversion. And the exception persists. The following
> is output from a little web crawler I wrote.
> 
> ERROR: Unable to load PDF document:
> http://www.polaroid.com/media/document/a932manualEN20091019.pdf
> java.io.IOException: Unknown xobject subtype 'PS'
> at
> org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject.createXObject(PDXObject
> .java:165)
> at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:161)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
> :226)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:20
> 6)
> at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
> 
> at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291
> )
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
> at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
> at webcrawler.WebCrawler.getContent(WebCrawler.java:1444)
> 
PDFBox doesn't support that kin of subtype for XObjects. Refering to the pdf reference manual
(v1.7 chapter 4.7.1 PostScript XObjects ) it's rarely used and shouldn't have any effect when
viewing the document. It could only be used when printing on a ps enabled printer. This feature
is likely to be removed from PDF in a future version.

PDFBox should ignore those PS XObjects in future.

> -George
> 

BR
Andreas Lehmkühler


Mime
View raw message