pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Downloadind a pdf file doesn't work
Date Sat, 13 Dec 2014 18:42:43 GMT
Hi,

So you're using a "special" http client...

Anyway, here's what I just did with the 1.8.9 version:

         URL url = new 
URL("http://esa.un.org/unpd/wup/PressRelease/WUP2014_PressRelease.pdf");
         InputStream is = url.openStream();
         PDDocument doc = PDDocument.load(is);
         System.out.println("pages: " + doc.getNumberOfPages());

All output I get is

     pages: 2

Btw the two "errors" you mention are warnings about malformed PDFs. 
However there's really a length 66346 in your file and I don't get that 
warning. This means that somehow you're not getting the exact file. 
Maybe save what you're downloading with your "http client" and compare 
it with that you download with a browser. Or try what I did and see if 
it works.

What version are you using? 1.8.9 is the current one.

Tilman

Am 13.12.2014 um 18:41 schrieb Walter Kehl:
> Hi John, Tilman,
>
> thanks for the reply. Here is some additional information:
>
> - the http client I am using to get the input stream already has a user
> agent set. Also I have downloaded with PDF box already lots of PDF files
> where there never was a problem.
> - when I try to load the document remotely from the URL, I get the following
> error messages:
>    18:34:32 WARN  BaseParser           :: Specified stream length 66346 is
> wrong. Fall back to reading stream until 'endstream'.
>    18:34:35 WARN  XrefTrailerResolver  :: Did not found XRef object at
> specified startxref position 0
> - I have written the input stream directly to a file and it was a valid PDF.
> It could load it both with an external tool and with PDFBox.
>
> Yes, of course I could always download a file first to a temp file and then
> load it into PDFBox. But I think the direct way is more elegant and faster.
> I have also debugged a little bit into the code and to me it doesn't look
> like PDFBox uses a temporary file, but rather reads directly from the input
> stream.... but I might be wrong.
>
> Anyway, thanks for providing such a good free software!
>
> Best
> Walter
>
> -----Original Message-----
> From: John Hewson [mailto:john@jahewson.com]
> Sent: Freitag, 12. Dezember 2014 18:57
> To: users@pdfbox.apache.org
> Subject: Re: Downloadind a pdf file doesn't work
>
> Good point Tilman. Walter, try saving writing the InputStream to a File and
> check that it's a valid PDF.
>
> -- John
>
>> On 12 Dec 2014, at 09:50, Tilman Hausherr <THausherr@t-online.de> wrote:
>>
>> This sounds more like a http problem. Try setting a user agent like a
> browser.
>> https://stackoverflow.com/questions/2529682/setting-user-agent-of-a-ja
>> va-urlconnection
>>
>> Tilman
>>
>> Am 12.12.2014 um 11:53 schrieb Walter Kehl:
>>> Hi all,
>>>
>>>   
>>> I have the following situation:
>>>
>>>   
>>> I am loading with PdfBox files from the internet with the call
>>>
>>> PDDocument document = PDDocument.load( inputStream );
>>>
>>>   
>>> So far it has worked nicely, but I have problems with this file :
>>> http://esa.un.org/unpd/wup/PressRelease/WUP2014_PressRelease.pdf
>>>
>>>   
>>> After I load it, it is empty, and the call
>>> document.getNumberOfPages() returns 0.
>>>
>>> However when I download the file manually and then load it into
>>> PdfBox, then everything is fine.
>>>
>>>   
>>> Any idea what could be happening? I am currently using PdfBox 1.8.5.
>>>
>>>   
>>> Thanks and Best Regards
>>>
>>> Walter
>>>
>>>   
>>>   
>>>   
>>>


Mime
View raw message