lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Black <bl...@apple.com>
Subject Re: HTML parser
Date Sun, 21 Apr 2002 21:49:11 GMT
I think I have found the secret recipe for doing this.......

1. The example at Sun for link extraction..this was very easy to convert 
over to my application.
http://developer.java.sun.com/developer/TechTips/1999/tt0923.html


2. Brian Goetz's (great) Library at
http://www.quiotix.com/opensource/html-parser


While the "Visitor Design Pattern" might make your eyes cross at first, 
it's actually pretty cool.  Here's  a simple Vistor class that I wrote 
to extract the HTML.  I also make reference to a piece of code for 
searching and replacing strings in a class called StripperUtils.java.  
If I understood the Visitor thing better, I could probably produce 
something more elegant...like the conversion of "&..;" with it's 
appropriate unencoded text.



---------------begin HTMLTextVisitor.java --------

import com.quiotix.html.parser.*;
import java.io.*;

public class HTMLTextVisitor extends HtmlVisitor{
     protected PrintWriter out;

     public HTMLTextVisitor(OutputStream os)     {
         out = new PrintWriter(os);
     }

     public HTMLTextVisitor(OutputStream os, String encoding) throws 
UnsupportedEncodingException {
             out = new PrintWriter( new OutputStreamWriter(os, 
encoding) );
     }

     public void finish() {
         out.flush();
     }

     public void visit(HtmlDocument.Text t) {
         String txt = t.toString();
         txt = StripperUtils.replace(txt,"&nbsp;"," ");
         txt = StripperUtils.replace(txt,"&nbsp;"," ");  // for some 
wierd reason, the first pass doesn't get all of them
         out.print(txt);
     }

---------- end HTMLTextVisitor-------


--------- begin StripperUtils.java ----------

     public static String replace(String originalText,
                                  String subStringToFind, String 
subStringToReplaceWith) {
         int s = 0;
         int e = 0;
         StringBuffer newText = new StringBuffer();
         while ((e = originalText.indexOf(subStringToFind, s)) >= 0) {
             newText.append(originalText.substring(s, e));
             newText.append(subStringToReplaceWith);
             s = e + subStringToFind.length();
         }
         newText.append(originalText.substring(s));
         return newText.toString();
     }

--------- End StripperUtils.java --------------














On Saturday, April 20, 2002, at 09:29 AM, lucene@libero.it wrote:

> Hi all,
>
> I'm very interested about this thread. I also have to solve the problem
> of spidering web sites, creating index (weel about this there is the
> BIG problem that lucene can't be integrated easily with a DB),
> extracting links from the page repeating all the process.
>
> For extracting links from a page I'm thinking to use JTidy. I think
> that with this library you can also parse a non well formed page (that
> you can take from the web with URLConnection) setting the property to
> clean the page. The class Tidy() returns a org.w3c.dom.Document that
> you can use for analizing all the document: for example you can use
> doc.getElementsByTagName(a) for taking all the a elements. You can
> parse as xml.
>
> Did someone solve the problem to spider recursively a web pages?
>
> Laura
>
>
>
>
>>
>>> While trying to research the same thing, I found the following...here
> 's a
>>> good example of link extraction.....
>>
>> Try http://www.quiotix.com/opensource/html-parser
>>
>> Its easy to write a Visitor which extracts the links; should take abou
> t ten
>> lines of code.
>>
>>
>>
>> --
>> Brian Goetz
>> Quiotix Corporation
>> brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-
> 8032
>>
>> http://www.quiotix.com
>>
>>
>> --
>> To unsubscribe, e-mail:   <mailto:lucene-user-
> unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail: <mailto:lucene-user-
> help@jakarta.apache.org>
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message