pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Downloadind a pdf file doesn't work
Date Sat, 13 Dec 2014 19:25:37 GMT
Oops, meant to say 1.8.8 is the current one.

Tilman

Am 13.12.2014 um 19:42 schrieb Tilman Hausherr:
> Hi,
>
> So you're using a "special" http client...
>
> Anyway, here's what I just did with the 1.8.9 version:
>
>         URL url = new 
> URL("http://esa.un.org/unpd/wup/PressRelease/WUP2014_PressRelease.pdf");
>         InputStream is = url.openStream();
>         PDDocument doc = PDDocument.load(is);
>         System.out.println("pages: " + doc.getNumberOfPages());
>
> All output I get is
>
>     pages: 2
>
> Btw the two "errors" you mention are warnings about malformed PDFs. 
> However there's really a length 66346 in your file and I don't get 
> that warning. This means that somehow you're not getting the exact 
> file. Maybe save what you're downloading with your "http client" and 
> compare it with that you download with a browser. Or try what I did 
> and see if it works.
>
> What version are you using? 1.8.9 is the current one.
>
> Tilman
>
> Am 13.12.2014 um 18:41 schrieb Walter Kehl:
>> Hi John, Tilman,
>>
>> thanks for the reply. Here is some additional information:
>>
>> - the http client I am using to get the input stream already has a user
>> agent set. Also I have downloaded with PDF box already lots of PDF files
>> where there never was a problem.
>> - when I try to load the document remotely from the URL, I get the 
>> following
>> error messages:
>>    18:34:32 WARN  BaseParser           :: Specified stream length 
>> 66346 is
>> wrong. Fall back to reading stream until 'endstream'.
>>    18:34:35 WARN  XrefTrailerResolver  :: Did not found XRef object at
>> specified startxref position 0
>> - I have written the input stream directly to a file and it was a 
>> valid PDF.
>> It could load it both with an external tool and with PDFBox.
>>
>> Yes, of course I could always download a file first to a temp file 
>> and then
>> load it into PDFBox. But I think the direct way is more elegant and 
>> faster.
>> I have also debugged a little bit into the code and to me it doesn't 
>> look
>> like PDFBox uses a temporary file, but rather reads directly from the 
>> input
>> stream.... but I might be wrong.
>>
>> Anyway, thanks for providing such a good free software!
>>
>> Best
>> Walter
>>
>> -----Original Message-----
>> From: John Hewson [mailto:john@jahewson.com]
>> Sent: Freitag, 12. Dezember 2014 18:57
>> To: users@pdfbox.apache.org
>> Subject: Re: Downloadind a pdf file doesn't work
>>
>> Good point Tilman. Walter, try saving writing the InputStream to a 
>> File and
>> check that it's a valid PDF.
>>
>> -- John
>>
>>> On 12 Dec 2014, at 09:50, Tilman Hausherr <THausherr@t-online.de> 
>>> wrote:
>>>
>>> This sounds more like a http problem. Try setting a user agent like a
>> browser.
>>> https://stackoverflow.com/questions/2529682/setting-user-agent-of-a-ja
>>> va-urlconnection
>>>
>>> Tilman
>>>
>>> Am 12.12.2014 um 11:53 schrieb Walter Kehl:
>>>> Hi all,
>>>>
>>>>   I have the following situation:
>>>>
>>>>   I am loading with PdfBox files from the internet with the call
>>>>
>>>> PDDocument document = PDDocument.load( inputStream );
>>>>
>>>>   So far it has worked nicely, but I have problems with this file :
>>>> http://esa.un.org/unpd/wup/PressRelease/WUP2014_PressRelease.pdf
>>>>
>>>>   After I load it, it is empty, and the call
>>>> document.getNumberOfPages() returns 0.
>>>>
>>>> However when I download the file manually and then load it into
>>>> PdfBox, then everything is fine.
>>>>
>>>>   Any idea what could be happening? I am currently using PdfBox 1.8.5.
>>>>
>>>>   Thanks and Best Regards
>>>>
>>>> Walter
>>>>
>>>>
>


Mime
View raw message