lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From starz10de <>
Subject Index html sites using IndexHtml
Date Sun, 26 Jul 2009 11:24:26 GMT


I am indexing a set of html websites using lucene (IndexHtml). The indexer
work fine and I can also find the indexed term but the problem this class
(IndexHtml) index all text inside the html site even the advertisements. I
am interested just in the body text and not interested in the advertisements
or side links text.

Any help how to solve this problem? Did I use the class wrongly?

View this message in context:
Sent from the Lucene - Java Users mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message