hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: Not able to download PDF and PNG files using Httpclient
Date Sun, 12 Apr 2009 10:10:59 GMT
On Fri, 2009-04-10 at 15:55 -0700, Rutuja Joshi wrote:
> Hello,
> I am working on a web crawler application and using HttpClient by Apache 
> for the same. I  have following issues that I am  not able to resolve: 
> (This is my first post and not sure to what extent I can provide the 
> details and ask questions, so please pardon me)
> 1> Whenever I try to download pdf file using HttpClient, the pdf that 
> gets downloaded is approximately half the size from the one I download 
> using Firefox. Same with png file. Both acrobat and image viewer reject 
> the files saying invalid format. There may be something related to 
> compression etc but how do I find out? I am reading from response as 
> input stream , wrap it around buffered stream and write to file. So 
> basically I am just fetching the raw bytes from the response. If needed, 
> I will provide details log ( I read about wire log, haven;t tried it but 
> if needed I 'll try to produce one and provide you).

Yes, wire log would be quite helpful, as well as the code snippet
demonstrating the way you are using HttpClient API. 

Logging guide for HttpClient 4.0: 

Logging guide for HttpClient 3.x:

> 2> How do I know if thethe file that I am fetching is the text file or 
> not?

By the Content-Type response header 

>  For e.g, given that I do not know the file type that I am fetching  
> is there any way to know from the content-type etc what type of file I 
> have fetched?
> I tried content-type header, its the same for a normal HTML file , a PDF 
> file and also for an image file.

In this case the server side code is broken.

> 3> Redirects - I have set followredirects = true. I have one URL that 
> upon accessing from Firefox redirects, but using HttpClient it does not. 
> The status code for some reason is 200 (OK), Should this have to be 3XX 
> for the HttpClient to follow redirects? The HTML dump from httpclient is 
> as follows:
> <html>
> <head>
>     <script>
>         <!--
>         redirect_url="http://www.feedroom.com/";
>         window.location.replace(redirect_url);
>         -->
>     </script>
> </head>
> <body>
>     Redirecting...<br/>
>     This url is deprecated. If your browser doesn't immediately redirect 
> you to the new url, please click the link below:<br/>
>     <a href="http://www.feedroom.com/">http://www.feedroom.com/</a>
> </body>
> </html>

HttpClient takes care of redirects automatically. If you do not want
that, you can always disable automatic redirect handling.

Hope this helps


> Thanks in advance!
> Rutuja
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org

To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org

View raw message