pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Fischer <fischer...@aon.at>
Subject Re: pdfbox not working for my pdf file.
Date Sun, 13 Dec 2009 17:45:38 GMT
Hi,

it seems that some PDF files created using TeX will do badly with PDFBox.
I have a version of
http://www.ams.org/era/2003-09-03/S1079-6762-03-00108-2/S1079-6762-03-00108-2.pdf
which produces similar results to your file:

BXC4BXBVCCCAC7C6C1BV CABXCBBXBTCABVC0 BTC6C6C7CDC6BVBXC5BXC6CCCB...

My impression is that this is due to some errors in the chain
TeX -> DVI -> ps -> pdf
probably due to some earlier versions of dvips (< 5.97?).
I can create a readable PDF file from the respective DVI file using different tools like dvipdfmx.
So I am not sure wether this is actually a PDFBox bug or a problem with invalid PDF files,
although JHOVE claims that the file is well-formed and valid. I suppose that a character or
glyph table is not recognised or found, and Jhove doesn't check "the glyph descriptions of
embedded fonts".

Actually, the version of the document mentioned above will crash both PDFBox 7.3 and 8.0
(Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider),
while pdftotext seems to do a reasonable job.

On my file mentioned, pdftotext will produce

Ä ÌÊÇÆÁ Ê Ë Ê À ÆÆÇÍÆ Å ÆÌË Ç ÌÀ Å ÊÁ Æ Å ÌÀ Å ÌÁ Ä ËÇ
Á Ì ÎÓÐÙÑ

which is no more helpful than the result from PDFBox...

I suppose that anybody willing to try may play around with the files available at
http://www.emis.de/journals/ERA-AMS/2003-01-003/2003-01-003.html

All the best
Thomas

Am 12.12.2009 um 18:58 schrieb Ernesto De Santis:

> Hi,
> 
> I've a pdf file that pdfbox cant read. Pdfbox read it without errors,
> but the output is only in an estrange format, like codes. Always a 'a'
> letter and two numbers: a34 a85 a94 a92......
> 
> I reported it as a bug time ago, without news about it.
> 
> My file was generated with a latex, I use Kile editor in ubuntu OS.
> This is the bug issue:
> 
> https://issues.apache.org/jira/browse/PDFBOX-534
> 
> 
> Regards,
> Ernesto.
> 
> 
> 
> 
> users-digest-help@pdfbox.apache.org escribió:
>> 
>> ------------------------------------------------------------------------
>> 
>> Asunto:
>> Re: pdfbox not working for my pdf file.
>> De:
>> Thomas Fischer <fischer.th@aon.at>
>> Fecha:
>> Thu, 10 Dec 2009 07:57:19 +0100
>> Para:
>> users@pdfbox.apache.org
>> 
>> Para:
>> users@pdfbox.apache.org
>> 
>> 
>> Hi,
>> 
>> I've been testing PDFBox on a number of different (mathematical) PDF files, and my
experience shows that PDFBox works in principle on all PDF files that are not image based,
with some specific errors depending on the the type and creation of the file. If you create
a PDF file by binding images together then this file can't be read by PDFBox.
>> An easy test is to try to copy text from your file using any PDF reader, e.g. Adobe's.
If you can copy text, PDFBox should be able to read it.
>> 
>> Cheers
>> Thomas
>> 
>> Am 09.12.2009 um 13:38 schrieb <shalini.popuri@wipro.com>:
>> 
>> 
>>> Hi,
>>> Can pdfbox works for all types of pdf files??
>>> My pdf file size is 6MB.But its not working fine with
>>> PDFTextStripper...ie...stripperobject.getText(doc);
>>> its not working fine....
>>> please let me know whether this can be used for all types of pdf files?
>>> if not,How one can decide whether a particular file is compatible with
>>> pdfbox??
>>> Thanks in advance....!
>>> 
>>> Please do not print this email unless it is absolutely necessary. 
>>> 
>>> The information contained in this electronic message and any attachments to this
message are intended for the exclusive use of the addressee(s) and may contain proprietary,
confidential or privileged information. If you are not the intended recipient, you should
not disseminate, distribute or copy this e-mail. Please notify the sender immediately and
destroy all copies of this message and any attachments. 
>>> 
>>> WARNING: Computer viruses can be transmitted via email. The recipient should
check this email and any attachments for the presence of viruses. The company accepts no liability
for any damage caused by any virus transmitted by this email. 
>>> 
>>> www.wipro.com
>>> 
>> 
>> 
>> 


Mime
View raw message