hc-httpclient-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Graeme" <coolki...@hotmail.com>
Subject How can I find certain words in a html page?
Date Wed, 12 Oct 2005 22:54:50 GMT
I am going to be using HTTPCLIENT to get the source of a web page and I am
hoping to be able to extract certain information from that webpage. It will
all be HTML and I am looking for all the information between these tags

//... HTML Stuff here

	<td class="alt1">(Simple 2 digit number I need here)</td>
</tr><tr align="center">
//... More HTML Stuff after this as well

	<td class="alt1">(Simple 2 digit number I need here)</td>
</tr><tr align="center">
//... HTML Stuff after this as well

I am thinking I am going to have to search through the
method.getResponseBody() for text that begins with </td> <td class="alt1">
and ends in </tr><tr align="center"> and get the data in the middle of them.

Although am I right in thinking I can't search through a line at a time? I
have to wait till the entire source comes in and then search through a
massive string?

Anyway once I have the data I want it put into a text file for the sake of
it which I can do. 
Here's the code so far 

import java.io.*;
import java.net.*;
import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.params.HttpMethodParams;

import java.io.*;

public class HttpClientTutorial {

  private static String url = "http://www.youngcoders.com/memberlist.php";

  public static void main(String[] args) {
    // Create an instance of HttpClient.
    HttpClient client = new HttpClient();

    // Create a method instance.
    GetMethod method = new GetMethod(url);

    // Provide custom retry handler is necessary
    		new DefaultHttpMethodRetryHandler(3, false));

    try {
      // Execute the method.
      int statusCode = client.executeMethod(method);

      if (statusCode != HttpStatus.SC_OK) {
        System.err.println("Method failed: " + method.getStatusLine());

      // Read the response body.
      byte[] responseBody = method.getResponseBody();

      // Deal with the response.
      // Use caution: ensure correct character encoding and is not binary

      File outFile = new File("age.html");  // name  file

      BufferedWriter writer = new BufferedWriter(new FileWriter(outFile));

      String line = new String(responseBody);


    } catch (HttpException e) {
      System.err.println("Fatal protocol violation: " + e.getMessage());
    } catch (IOException e) {
      System.err.println("Fatal transport error: " + e.getMessage());
    } finally {
      // Release the connection.

At the moment that just gets the entire web page and puts it in a .html file
but how do I just get certain bits from the page? 

Thanks for your time and if you don't understand anything just tell me and
Ill try and explain better.

To unsubscribe, e-mail: httpclient-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: httpclient-user-help@jakarta.apache.org

View raw message