hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rutuja Joshi <Rutuja.Jo...@Sun.COM>
Subject Not able to download PDF and PNG files using Httpclient
Date Fri, 10 Apr 2009 22:55:09 GMT

I am working on a web crawler application and using HttpClient by Apache 
for the same. I  have following issues that I am  not able to resolve: 
(This is my first post and not sure to what extent I can provide the 
details and ask questions, so please pardon me)

1> Whenever I try to download pdf file using HttpClient, the pdf that 
gets downloaded is approximately half the size from the one I download 
using Firefox. Same with png file. Both acrobat and image viewer reject 
the files saying invalid format. There may be something related to 
compression etc but how do I find out? I am reading from response as 
input stream , wrap it around buffered stream and write to file. So 
basically I am just fetching the raw bytes from the response. If needed, 
I will provide details log ( I read about wire log, haven;t tried it but 
if needed I 'll try to produce one and provide you).

2> How do I know if thethe file that I am fetching is the text file or 
not? For e.g, given that I do not know the file type that I am fetching  
is there any way to know from the content-type etc what type of file I 
have fetched?
I tried content-type header, its the same for a normal HTML file , a PDF 
file and also for an image file.

3> Redirects - I have set followredirects = true. I have one URL that 
upon accessing from Firefox redirects, but using HttpClient it does not. 
The status code for some reason is 200 (OK), Should this have to be 3XX 
for the HttpClient to follow redirects? The HTML dump from httpclient is 
as follows:

    This url is deprecated. If your browser doesn't immediately redirect 
you to the new url, please click the link below:<br/>
    <a href="http://www.feedroom.com/">http://www.feedroom.com/</a>

Thanks in advance!

To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org

View raw message