Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Sun, 10 Jul 2011 22:47:53 -0700
From: Marvin Humphrey <marvin@rectangular.com>
To: lucy-user@incubator.apache.org
Message-ID: <20110711054753.GA11852@rectangular.com>
References: <1310354903.5094.12.camel@putnam>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1310354903.5094.12.camel@putnam>
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: Re: [lucy-user] Indexing HTML documents

On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote:
> I'm just getting started with trying out Lucy. Installation went without
> a hitch and I've successfully worked my way through the tutorials.

Nice...

> Congratulations on getting the project to this level of quality.

Thanks!  :)

> My main interest is indexing HTML documents for web sites.  It seems
> that if I feed the HTML file contents to the Lucy indexer, all the
> markup (tags and attributes) ends up in the index and consequently comes
> back out in the highlighted excerpts. Is it my responsibility to strip
> the tags out before passing the text to the indexer?

You have to handle document parsing yourself and supply plain text to Lucy.

Lucy is a specialized fulltext indexing library rather than a turnkey indexing
solution, so it does not bundle file-format-specific parsing tools.  Instead,
it is designed so that it may serve as the indexing component within a larger
system which aggregates additional components such as parsers.

At this point I would ordinarily suggest a variety of HTML parsing CPAN
distributions, but presuming that you are the Grant McLean who maintains
XML::Simple and XML::SAX, I imagine that you are familiar with the lay of the
land.  :)

Marvin Humphrey