Return-Path: X-Original-To: apmail-incubator-lucy-user-archive@www.apache.org Delivered-To: apmail-incubator-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C972B48D1 for ; Mon, 11 Jul 2011 06:00:49 +0000 (UTC) Received: (qmail 38531 invoked by uid 500); 11 Jul 2011 06:00:49 -0000 Delivered-To: apmail-incubator-lucy-user-archive@incubator.apache.org Received: (qmail 38153 invoked by uid 500); 11 Jul 2011 06:00:32 -0000 Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-user@incubator.apache.org Delivered-To: mailing list lucy-user@incubator.apache.org Received: (qmail 38129 invoked by uid 99); 11 Jul 2011 06:00:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jul 2011 06:00:27 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jul 2011 06:00:20 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1Qg9LV-000362-VK for lucy-user@incubator.apache.org; Sun, 10 Jul 2011 22:47:53 -0700 Date: Sun, 10 Jul 2011 22:47:53 -0700 From: Marvin Humphrey To: lucy-user@incubator.apache.org Message-ID: <20110711054753.GA11852@rectangular.com> References: <1310354903.5094.12.camel@putnam> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1310354903.5094.12.camel@putnam> User-Agent: Mutt/1.5.18 (2008-05-17) Subject: Re: [lucy-user] Indexing HTML documents On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote: > I'm just getting started with trying out Lucy. Installation went without > a hitch and I've successfully worked my way through the tutorials. Nice... > Congratulations on getting the project to this level of quality. Thanks! :) > My main interest is indexing HTML documents for web sites. It seems > that if I feed the HTML file contents to the Lucy indexer, all the > markup (tags and attributes) ends up in the index and consequently comes > back out in the highlighted excerpts. Is it my responsibility to strip > the tags out before passing the text to the indexer? You have to handle document parsing yourself and supply plain text to Lucy. Lucy is a specialized fulltext indexing library rather than a turnkey indexing solution, so it does not bundle file-format-specific parsing tools. Instead, it is designed so that it may serve as the indexing component within a larger system which aggregates additional components such as parsers. At this point I would ordinarily suggest a variety of HTML parsing CPAN distributions, but presuming that you are the Grant McLean who maintains XML::Simple and XML::SAX, I imagine that you are familiar with the lay of the land. :) Marvin Humphrey