lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: indexing/searching a website
Date Thu, 27 Nov 2003 10:36:48 GMT
On Thursday, November 27, 2003, at 03:36  AM, Michal S wrote:
> Dear Group Members,
> I have looked in archives for a simple tutorial which could guide me 
> throught process of integrating Lucene with a website based on Struts.
> The website uses tiles, the content of the tiles is kept in multiple 
> jsp files.
> I have read several Marco's posts which seem to be close to my 
> problem. However, my experience in Lucene is limited to indexing 
> static html file repository, so I need some kind of tutorial.

You have a lot of options.  If you want to index the content directly 
from the JSP's, look at the demo that ships with Lucene.  You will 
likely have to implement a little bit of parsing to filter out 
taglibs/directives and get just the unadorned content.

Another option is to deploy your site and crawl it from the outside 
(have a look at Nutch at sourceforge - or write your own using 
HttpClient and some HTML parsing for hyperlinks).

Lucene is not restricted in what it can index.  If you can get text out 
of it, Lucene can index it.

I would argue that content within the JSP is a bad thing given that you 
want to index it - perhaps it makes more sense to put the content 
somewhere easier to get at like a database?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message