pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Kehl <walter.k...@outlook.com>
Subject RE: Downloadind a pdf file doesn't work
Date Fri, 19 Dec 2014 18:20:53 GMT
Hi Tilman,

thanks for your response. When I use your code (I am now also at version
1.8.8) I get the correct result. I think the problem has to do with the
stream object which gets used.

I am using (for various reasons) the http client from the Apache
HttpComponents library. This client creates an input stream of type
EofSensorInputStream whereas the URL.openStream method from your code
returns an HttpInputStream object. Maybe the EofSensorInputStream closes the
stream too early, because the following scenario happens:

- I download the pdf file with my Apache client. The resulting file can be
opened without a problem with a PDF viewer. Also a file compare doesn't show
any difference to a manually downloaded version.
- But when I then read in this file into Pdf box, PDFBox cannot read it
either and returns this warning: "End-of-File, expected line". 

But things are getting are involved here and I am wondering whether it is
worth spending more time on this issue...


Best regards
Walter




-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Samstag, 13. Dezember 2014 20:26
To: users@pdfbox.apache.org
Subject: Re: Downloadind a pdf file doesn't work

Oops, meant to say 1.8.8 is the current one.

Tilman

Am 13.12.2014 um 19:42 schrieb Tilman Hausherr:
> Hi,
>
> So you're using a "special" http client...
>
> Anyway, here's what I just did with the 1.8.9 version:
>
>         URL url = new
> URL("http://esa.un.org/unpd/wup/PressRelease/WUP2014_PressRelease.pdf");
>         InputStream is = url.openStream();
>         PDDocument doc = PDDocument.load(is);
>         System.out.println("pages: " + doc.getNumberOfPages());
>
> All output I get is
>
>     pages: 2
>
> Btw the two "errors" you mention are warnings about malformed PDFs. 
> However there's really a length 66346 in your file and I don't get 
> that warning. This means that somehow you're not getting the exact 
> file. Maybe save what you're downloading with your "http client" and 
> compare it with that you download with a browser. Or try what I did 
> and see if it works.
>
> What version are you using? 1.8.9 is the current one.
>
> Tilman
>
> Am 13.12.2014 um 18:41 schrieb Walter Kehl:
>> Hi John, Tilman,
>>
>> thanks for the reply. Here is some additional information:
>>
>> - the http client I am using to get the input stream already has a 
>> user agent set. Also I have downloaded with PDF box already lots of 
>> PDF files where there never was a problem.
>> - when I try to load the document remotely from the URL, I get the 
>> following error messages:
>>    18:34:32 WARN  BaseParser           :: Specified stream length 
>> 66346 is
>> wrong. Fall back to reading stream until 'endstream'.
>>    18:34:35 WARN  XrefTrailerResolver  :: Did not found XRef object 
>> at specified startxref position 0
>> - I have written the input stream directly to a file and it was a 
>> valid PDF.
>> It could load it both with an external tool and with PDFBox.
>>
>> Yes, of course I could always download a file first to a temp file 
>> and then load it into PDFBox. But I think the direct way is more 
>> elegant and faster.
>> I have also debugged a little bit into the code and to me it doesn't 
>> look like PDFBox uses a temporary file, but rather reads directly 
>> from the input stream.... but I might be wrong.
>>
>> Anyway, thanks for providing such a good free software!
>>
>> Best
>> Walter
>>
>> -----Original Message-----
>> From: John Hewson [mailto:john@jahewson.com]
>> Sent: Freitag, 12. Dezember 2014 18:57
>> To: users@pdfbox.apache.org
>> Subject: Re: Downloadind a pdf file doesn't work
>>
>> Good point Tilman. Walter, try saving writing the InputStream to a 
>> File and check that it's a valid PDF.
>>
>> -- John
>>
>>> On 12 Dec 2014, at 09:50, Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>
>>> This sounds more like a http problem. Try setting a user agent like 
>>> a
>> browser.
>>> https://stackoverflow.com/questions/2529682/setting-user-agent-of-a-
>>> ja
>>> va-urlconnection
>>>
>>> Tilman
>>>
>>> Am 12.12.2014 um 11:53 schrieb Walter Kehl:
>>>> Hi all,
>>>>
>>>>   I have the following situation:
>>>>
>>>>   I am loading with PdfBox files from the internet with the call
>>>>
>>>> PDDocument document = PDDocument.load( inputStream );
>>>>
>>>>   So far it has worked nicely, but I have problems with this file :
>>>> http://esa.un.org/unpd/wup/PressRelease/WUP2014_PressRelease.pdf
>>>>
>>>>   After I load it, it is empty, and the call
>>>> document.getNumberOfPages() returns 0.
>>>>
>>>> However when I download the file manually and then load it into 
>>>> PdfBox, then everything is fine.
>>>>
>>>>   Any idea what could be happening? I am currently using PdfBox 1.8.5.
>>>>
>>>>   Thanks and Best Regards
>>>>
>>>> Walter
>>>>
>>>>
>


Mime
View raw message