lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ross Rankin" <r...@commercescience.com>
Subject HTMLParser
Date Thu, 13 Jul 2006 20:33:44 GMT
Since I cannot seem to access the HTMLParser mailing list and I saw the
library recommended here, I thought someone here that has used it
successfully can help me out.  

I have HTML text stored in a database field which I want to add to a
Lucene document, but I want to remove the HTML tags, so HTMLParser
looked like it would fit the bill.

 

First, it does not seem to be parsing… hence my first problem and it
also is throwing an exception along with this phrase sprinkled around
“(No such file or directory)”.  

 

I think I may be using it wrong, so here’s what I have done.  In my
object where I create my document, I have the following code:

        StringExtractor extract = new
StringExtractor(record.get("column14").toString().trim());

        try {

            value = extract.extractStrings(false);

        } catch (ParserException pe) {

            System.out.println("Index Long Description Parser
Exception:" + pe.getMessage() );

            value = "";

        }

 

What I get out in value is like the following:

<LI><FONT size=2>Crystal Clear III and 3D combfilter for natural, sharp
images with enhanced quality </FONT>

<LI><FONT size=2>Compact and sleek design </FONT>

<LI><FONT size=2>Incredible Surround (No such file or directory)

 

So the tags are still there and oddly the ‘(No such file or directory)’
phrase is added which is not in the original text.

 

Then I get a ParserException.

 

What am I doing wrong?

 

Thanks,

Ross



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message