Return-Path: X-Original-To: apmail-incubator-lucy-user-archive@www.apache.org Delivered-To: apmail-incubator-lucy-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E317D6391 for ; Tue, 12 Jul 2011 11:36:47 +0000 (UTC) Received: (qmail 50462 invoked by uid 500); 12 Jul 2011 11:36:47 -0000 Delivered-To: apmail-incubator-lucy-user-archive@incubator.apache.org Received: (qmail 50384 invoked by uid 500); 12 Jul 2011 11:36:45 -0000 Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-user@incubator.apache.org Delivered-To: mailing list lucy-user@incubator.apache.org Received: (qmail 50367 invoked by uid 99); 12 Jul 2011 11:36:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jul 2011 11:36:43 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.98.116.241] (HELO pekmac.local) (209.98.116.241) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jul 2011 11:36:37 +0000 Received: from pekmac.local (localhost [127.0.0.1]) by pekmac.local (Postfix) with ESMTP id 826556D00DC; Tue, 12 Jul 2011 06:36:15 -0500 (CDT) Message-ID: <4E1C31AE.6090401@peknet.com> Date: Tue, 12 Jul 2011 06:36:14 -0500 From: Peter Karman Reply-To: peter@peknet.com User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.18) Gecko/20110616 Thunderbird/3.1.11 MIME-Version: 1.0 To: lucy-user@incubator.apache.org CC: Grant McLean References: <1310354903.5094.12.camel@putnam> In-Reply-To: <1310354903.5094.12.camel@putnam> X-Enigmail-Version: 1.1.1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-user] Indexing HTML documents Grant McLean wrote on 7/10/11 10:28 PM: > Hi all > > I'm just getting started with trying out Lucy. Installation went without > a hitch and I've successfully worked my way through the tutorials. > Congratulations on getting the project to this level of quality. > > My main interest is indexing HTML documents for web sites. It seems > that if I feed the HTML file contents to the Lucy indexer, all the > markup (tags and attributes) ends up in the index and consequently comes > back out in the highlighted excerpts. Is it my responsibility to strip > the tags out before passing the text to the indexer? Or is there a > simple option I can enable somewhere to have this happen automatically? > Consider using Swish3 with the Lucy backend. http://search.cpan.org/dist/SWISH-Prog-Lucy/ If you install SWISH::Prog::Lucy you'll get the swish3 cli with which you can easily index .html, .xml, .pdf, .doc, .xls, .txt, etc. Example: index docs: % swish3 -F lucy -i path/to/html/files search docs: % swish3 -q 'some query' Since the index created is a standard Lucy index, you can search it with the relevant Lucy classes, or use the SWISH::Prog::Lucy::Searcher wrapper (which automatically refreshes the index handle when the index is updated). See also the new Dezi REST server if you want to put a web service in front of your Lucy index, like Solr: http://search.cpan.org/dist/Dezi Docs are still a bit sparse; get in touch if you're interested in helping flesh them out. -- Peter Karman . http://peknet.com/ . peter@peknet.com