lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <ysee...@gmail.com>
Subject Re: HTMLParser
Date Thu, 13 Jul 2006 21:34:27 GMT
I've never used HTMLParser, but if you have malformed., incomplete, or
optional HTML that would otherwise choke an HTML parser, you could use
Solr's HTMLStripping:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e

It's pretty stand-alone, so it should be trivial to rip it out of Solr
and re-use it in your Lucene project.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 7/13/06, Ross Rankin <ross@commercescience.com> wrote:
> Since I cannot seem to access the HTMLParser mailing list and I saw the
> library recommended here, I thought someone here that has used it
> successfully can help me out.
>
> I have HTML text stored in a database field which I want to add to a
> Lucene document, but I want to remove the HTML tags, so HTMLParser
> looked like it would fit the bill.
>
>
>
> First, it does not seem to be parsing… hence my first problem and it
> also is throwing an exception along with this phrase sprinkled around
> "(No such file or directory)".
>
>
>
> I think I may be using it wrong, so here's what I have done.  In my
> object where I create my document, I have the following code:
>
>         StringExtractor extract = new
> StringExtractor(record.get("column14").toString().trim());
>
>         try {
>
>             value = extract.extractStrings(false);
>
>         } catch (ParserException pe) {
>
>             System.out.println("Index Long Description Parser
> Exception:" + pe.getMessage() );
>
>             value = "";
>
>         }
>
>
>
> What I get out in value is like the following:
>
> <LI><FONT size=2>Crystal Clear III and 3D combfilter for natural, sharp
> images with enhanced quality </FONT>
>
> <LI><FONT size=2>Compact and sleek design </FONT>
>
> <LI><FONT size=2>Incredible Surround (No such file or directory)
>
>
>
> So the tags are still there and oddly the '(No such file or directory)'
> phrase is added which is not in the original text.
>
>
>
> Then I get a ParserException.
>
>
>
> What am I doing wrong?
>
>
>
> Thanks,
>
> Ross

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message