lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Index html sites using IndexHtml
Date Mon, 27 Jul 2009 16:38:19 GMT

On Jul 26, 2009, at 7:24 AM, starz10de wrote:

> Hi,
> I am indexing a set of html websites using lucene (IndexHtml). The  
> indexer
> work fine and I can also find the indexed term but the problem this  
> class
> (IndexHtml) index all text inside the html site even the  
> advertisements. I
> am interested just in the body text and not interested in the  
> advertisements
> or side links text.
> Any help how to solve this problem? Did I use the class wrongly?

No, you didn't do anything wrong.  That class does not have any  
capabilities like you want (in fact, it's a pretty basic bit of demo  
code).  You might look into some more robust HTML parsing libraries  
out there.


Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message