lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: [lucy-user] Indexing HTML documents
Date Mon, 11 Jul 2011 05:47:53 GMT
On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
> I'm just getting started with trying out Lucy. Installation went without
> a hitch and I've successfully worked my way through the tutorials.


> Congratulations on getting the project to this level of quality.

Thanks!  :)

> My main interest is indexing HTML documents for web sites.  It seems
> that if I feed the HTML file contents to the Lucy indexer, all the
> markup (tags and attributes) ends up in the index and consequently comes
> back out in the highlighted excerpts. Is it my responsibility to strip
> the tags out before passing the text to the indexer?

You have to handle document parsing yourself and supply plain text to Lucy.

Lucy is a specialized fulltext indexing library rather than a turnkey indexing
solution, so it does not bundle file-format-specific parsing tools.  Instead,
it is designed so that it may serve as the indexing component within a larger
system which aggregates additional components such as parsers.

At this point I would ordinarily suggest a variety of HTML parsing CPAN
distributions, but presuming that you are the Grant McLean who maintains
XML::Simple and XML::SAX, I imagine that you are familiar with the lay of the
land.  :)

Marvin Humphrey

View raw message