incubator-lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant McLean <>
Subject [lucy-user] Indexing HTML documents
Date Mon, 11 Jul 2011 03:28:23 GMT
Hi all

I'm just getting started with trying out Lucy. Installation went without
a hitch and I've successfully worked my way through the tutorials.
Congratulations on getting the project to this level of quality.

My main interest is indexing HTML documents for web sites.  It seems
that if I feed the HTML file contents to the Lucy indexer, all the
markup (tags and attributes) ends up in the index and consequently comes
back out in the highlighted excerpts. Is it my responsibility to strip
the tags out before passing the text to the indexer? Or is there a
simple option I can enable somewhere to have this happen automatically?


View raw message