pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PDFBOX-267) CMap parse fails during text extract
Date Mon, 04 Aug 2008 17:34:44 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619598#action_12619598
] 

Jukka Zitting commented on PDFBOX-267:
--------------------------------------

[Comment on SourceForge]
Date: 2008-04-24 14:42
Sender: bmk06
Logged In: YES 
user_id=1683216
Originator: NO

Hi, I've recently come across exactly the same error when attempting to
extract text from a certain PDF. Has there been any progress fixing it? I'm
using pdfbox 0.7.3 and fontbox 0.1.0.

Hope you can help, thanks.

Ben Kirby
kirby.bm@gmail.com

[Comment on SourceForge]
Date: 2008-04-29 18:16
Sender: matthillsdon
Logged In: YES 
user_id=701665
Originator: YES

File Added: RWPRNU+Univers-Light.cmap
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=276310&aid=1702313

[Comment on SourceForge]
Date: 2008-04-29 18:17
Sender: matthillsdon
Logged In: YES 
user_id=701665
Originator: YES

File Added: SWPRNU+Myriad-Bold-Identity-H.cmap
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=276312&aid=1702313
							
[Comment on SourceForge]
Date: 2008-04-29 18:30
Sender: matthillsdon
Logged In: YES 
user_id=701665
Originator: YES

I've attached the two CMap streams that prevent text-extract for my PDF.
ExtractFonts didn't find them as they are resources of PDXObjectForm
objects rather than pages.

Perhaps the PDF creation software is at fault?  Ben, can you point me to
the relevant specification?  It would be good to cope anyway though if
there is a reasonable approach.

There are two issues:
  1) CR in seemly incorrect places e.g. <0000\r>
  2) beginbfchar<0000\r> - missing whitespace caused misparse.

Not so nice patch to work-around / illustrate these issue attached.
File Added: bug1702313-1.patch
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=276314&aid=1702313

[Comment on SourceForge]
Date: 2008-05-29 11:11
Sender: nobody
Logged In: NO 

Hi both, any progress on this?

Thanks,
Ben


> CMap parse fails during text extract
> ------------------------------------
>
>                 Key: PDFBOX-267
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-267
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1702313
> Originally submitted by matthillsdon on 2007-04-17 09:21.
> Unfortunately I cannot supply the PDF file.  Any suggestion appreciated.
> Exception in thread "main" java.io.IOException: Error: expected the end of a dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:220)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:79)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> ...
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1702313&file_id=226802
> ExtractFonts.java (text/java), 1721 bytes
> A simple program to extract fonts and CMap streams
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Sorry for the delay.  Updated extract output at
> http://www.hillsdon.net/CMapDocument3.pdf
> Stack trace for text extract as before:
> Exception in thread "main" java.io.IOException: Error: expected the end of a dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:117)
> ...
> Thanks, Matt.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> Hi Matt,
> any update?
> Ben
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> ok, I looked at it some more and I'd like to have you get the latest nightly build and
try to run ExtractText on your original PDF again.  If it doesn't work then run the ExtractFonts
again(using the nightly build) and post the results.
> The issue is that there is some extra data at the end of the Cmap stream and tonight
I happened to fix an issue with parsing and having extra data at the end of the stream for
a different user.  So I don't know if this is the same issue but I'd rather have you try the
nightly build than have me chasing a ghost.
> Ben
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Output with the decryption here
> http://www.hillsdon.net/CMapDocument2.pdf
> Thanks.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> shoot, I think your document was encrypted.  It needs to be decrypted for the extraction
to work, I should have had that as part of the program.  Can you take the attached program
and add the lines after the PDDocument.load call
> if( doc.isEncrypted() )
> {
>     doc.decrypt( "" );
> }
> and resend the CMapDocument.pdf
> Thanks,
> Ben
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Result too large to attach.  Please see
> http://www.hillsdon.net/CMapDocument.pdf
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> Attached is a simple java program that will create a new pseudo PDF document that contains
just the Font information.  Please run it on the problem PDF and upload the resulting CmapDocument.pdf

> It is a simple command line program, first compile then run it like this
> java ExtractFonts my.pdf
> Let me know if you have any questions getting it running.
> Ben
> File Added: ExtractFonts.java
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> No change unfortunately - with FontBox-0.2.0-dev-20070424 the stack trace is identical.
> Exception in thread "main" java.io.IOException: Error: expected the end of a dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
> ...
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> I just update the CMapParser with a bug from 
> https://sourceforge.net/forum/message.php?msg_id=4269559
> please get tonights FontBox build and give it a try
> http://www.fontbox.org/fontbox
> [comment on SourceForge]
> Originally sent by matthillsdon.
> Logged In: YES 
> user_id=701665
> Originator: YES
> Hi Ben, thanks for the quick response.
> Using the nightly build [1] the stack trace is the same except for line numbers:
> Exception in thread "main" java.io.IOException: Error: expected the end of a dictionary.
>         at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:117)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
> ...
> Extracting the fonts sounds ideal.
> [1] http://www.pdfbox.org/dist/PDFBox-0.7.4-dev-20070418.zip
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> Originator: NO
> Hi Matt,
> Can you try one for me first; upgrade to the latest nightly build of PDFBox( http://www.pdfbox.org/dist/
) and see if this is still an issue.  There have been some changes to the CMAPParser.
> If it is still an issue I think we can write a simple program to extract just the fonts
from your PDF and that should be enough for me to fix the bug.
> Ben

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message