lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <pe...@peknet.com>
Subject [lucy-dev] Re: [lucy-user] Indexing HTML documents
Date Tue, 12 Jul 2011 02:17:48 GMT


Grant McLean wrote on 7/10/11 10:28 PM:
> Hi all
> 
> I'm just getting started with trying out Lucy. Installation went without
> a hitch and I've successfully worked my way through the tutorials.
> Congratulations on getting the project to this level of quality.
> 
> My main interest is indexing HTML documents for web sites.  It seems
> that if I feed the HTML file contents to the Lucy indexer, all the
> markup (tags and attributes) ends up in the index and consequently comes
> back out in the highlighted excerpts. Is it my responsibility to strip
> the tags out before passing the text to the indexer? Or is there a
> simple option I can enable somewhere to have this happen automatically?
> 

Consider using Swish3 with the Lucy backend.

http://search.cpan.org/dist/SWISH-Prog-Lucy/

If you install SWISH::Prog::Lucy you'll get the swish3 cli with which you can
easily index .html, .xml, .pdf, .doc, .xls, .txt, etc.

Example:

index docs:
 % swish3 -F lucy -i path/to/html/files

search docs:
 % swish3 -q 'some query'

Since the index created is a standard Lucy index, you can search it with the
relevant Lucy classes, or use the SWISH::Prog::Lucy::Searcher wrapper (which
automatically refreshes the index handle when the index is updated).

See also the new Dezi REST server if you want to put a web service in front of
your Lucy index, like Solr:

 http://search.cpan.org/dist/Dezi

Docs are still a bit sparse; get in touch if you're interested in helping flesh
them out.



-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Mime
View raw message