Mailing-List: contact lucy-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-user@incubator.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Message-ID: <4E1C31AE.6090401@peknet.com>
Date: Tue, 12 Jul 2011 06:36:14 -0500
From: Peter Karman <peter@peknet.com>
Reply-To: peter@peknet.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US;
 rv:1.9.2.18) Gecko/20110616 Thunderbird/3.1.11
MIME-Version: 1.0
To: lucy-user@incubator.apache.org
CC: Grant McLean <grant@catalyst.net.nz>
References: <1310354903.5094.12.camel@putnam>
In-Reply-To: <1310354903.5094.12.camel@putnam>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: Re: [lucy-user] Indexing HTML documents

Grant McLean wrote on 7/10/11 10:28 PM:
> Hi all
> 
> I'm just getting started with trying out Lucy. Installation went without
> a hitch and I've successfully worked my way through the tutorials.
> Congratulations on getting the project to this level of quality.
> 
> My main interest is indexing HTML documents for web sites.  It seems
> that if I feed the HTML file contents to the Lucy indexer, all the
> markup (tags and attributes) ends up in the index and consequently comes
> back out in the highlighted excerpts. Is it my responsibility to strip
> the tags out before passing the text to the indexer? Or is there a
> simple option I can enable somewhere to have this happen automatically?
> 

Consider using Swish3 with the Lucy backend.

http://search.cpan.org/dist/SWISH-Prog-Lucy/

If you install SWISH::Prog::Lucy you'll get the swish3 cli with which you can
easily index .html, .xml, .pdf, .doc, .xls, .txt, etc.

Example:

index docs:
 % swish3 -F lucy -i path/to/html/files

search docs:
 % swish3 -q 'some query'

Since the index created is a standard Lucy index, you can search it with the
relevant Lucy classes, or use the SWISH::Prog::Lucy::Searcher wrapper (which
automatically refreshes the index handle when the index is updated).

See also the new Dezi REST server if you want to put a web service in front of
your Lucy index, like Solr:

 http://search.cpan.org/dist/Dezi

Docs are still a bit sparse; get in touch if you're interested in helping flesh
them out.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com