hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Downloading HTML frameset pages via HTTPClient
Date Mon, 24 Aug 2009 20:00:28 GMT
Hi Melroy,

On Aug 24, 2009, at 12:20pm, melroyr wrote:

>
> I have written a program to download html pages from harristeeter.  
> However,
> when I run my program, I get the following
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
> "http://www.w3.org/TR/html4/frameset.dtd">
> <html>
> <head>
> <title>Your Personal Shopping List</title>
> <meta http-equiv="Content-Type" content="text/html;  
> charset=iso-8859-1">

[snip]

> </frameset>
> <frame src="actions.jsp" name="bottomFrame" scrolling="YES" noresize>
> </frameset>
>
> <noframes><body>
> This application requires the use of frames, which your browser does  
> not
> support.
> </body></noframes>
>
> </html>
>
> The URL I am using to download the pages is
> http://flyer.harristeeter.com/HT_eVIC/ThisWeek/ReviewAllSpecials.jsp
>
> Please advise if there is some setting that I need do set in  
> HttpClient? I
> have read about HtmlCleaner and stuff but I do not think they will  
> help.

Well, first it would help to know what you think is the problem. The  
above page seems OK to me.

If I had to guess, the issue is that you want the content of the frame  
(e.g. the <frame src="xxx"> link)

If so, then HttpClient can't automagically help you here. Easiest  
approach would be to use a regex to extract the src="xxx" links,  
convert them from relative to absolute, and fetch again...similar to  
what a real web crawler might do.

-- Ken


---------------------------------------------------------------------
To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
For additional commands, e-mail: httpclient-users-help@hc.apache.org


Mime
View raw message