hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Kalnichevski <ol...@apache.org>
Subject Re: Not able to download PDF and PNG files using Httpclient
Date Sun, 12 Apr 2009 10:10:59 GMT
On Fri, 2009-04-10 at 15:55 -0700, Rutuja Joshi wrote:
> Hello,
> 
> I am working on a web crawler application and using HttpClient by Apache 
> for the same. I  have following issues that I am  not able to resolve: 
> (This is my first post and not sure to what extent I can provide the 
> details and ask questions, so please pardon me)
> 
> 1> Whenever I try to download pdf file using HttpClient, the pdf that 
> gets downloaded is approximately half the size from the one I download 
> using Firefox. Same with png file. Both acrobat and image viewer reject 
> the files saying invalid format. There may be something related to 
> compression etc but how do I find out? I am reading from response as 
> input stream , wrap it around buffered stream and write to file. So 
> basically I am just fetching the raw bytes from the response. If needed, 
> I will provide details log ( I read about wire log, haven;t tried it but 
> if needed I 'll try to produce one and provide you).
> 

Yes, wire log would be quite helpful, as well as the code snippet
demonstrating the way you are using HttpClient API. 

Logging guide for HttpClient 4.0: 
http://hc.apache.org/httpcomponents-client/logging.html

Logging guide for HttpClient 3.x:
http://hc.apache.org/httpclient-3.x/logging.html

> 2> How do I know if thethe file that I am fetching is the text file or 
> not?

By the Content-Type response header 

>  For e.g, given that I do not know the file type that I am fetching  
> is there any way to know from the content-type etc what type of file I 
> have fetched?
> I tried content-type header, its the same for a normal HTML file , a PDF 
> file and also for an image file.
> 

In this case the server side code is broken.


> 3> Redirects - I have set followredirects = true. I have one URL that 
> upon accessing from Firefox redirects, but using HttpClient it does not. 
> The status code for some reason is 200 (OK), Should this have to be 3XX 
> for the HttpClient to follow redirects? The HTML dump from httpclient is 
> as follows:
> 
> <html>
> <head>
>     <script>
>         <!--
>         redirect_url="http://www.feedroom.com/";
>         window.location.replace(redirect_url);
>         -->
>     </script>
> </head>
> <body>
>     Redirecting...<br/>
>     This url is deprecated. If your browser doesn't immediately redirect 
> you to the new url, please click the link below:<br/>
>     <a href="http://www.feedroom.com/">http://www.feedroom.com/</a>
> </body>
> </html>
> 

HttpClient takes care of redirects automatically. If you do not want
that, you can always disable automatic redirect handling.

Hope this helps

Oleg


> Thanks in advance!
> Rutuja
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message