lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: HTML saga continues...
Date Thu, 12 Dec 2002 19:59:19 GMT
On a related note, I've also released a project that I developed for my 
book and for presentations that I have been giving on Ant, XDoclet, and 
JUnit.  This project is a documentation search engine with a web 
(Struts) interface.  It uses Lucene and the Ant task I mentioned already 
to index a directory full of HTML and text files.  The sample data 
provided is Ant's documentation.

Its available as version 0.3 (currently, but always grab the latest 
thats there) at

I have not documented it well yet, but that is my plan over the next 
couple of weeks.

To get it running you need:

- Ant 1.5.1 (1.5 is not sufficient)
- JUnit 3.8 or up (3.8.1 is the latest)
- j2ee.jar - I don't provide this in the download for size (and legal?) 

Build it this way:

	ant -Dj2ee.jar=/path/to/my/j2ee.jar

Or if you run it without the -D switch it will tell you where to place 
j2ee.jar by default.  If you have J2EE_HOME set it will pick that up 
automatically and use it appropriately.

Deploy the WAR in a web container, or the EAR in JBoss.  Navigate to:


and search for your favorite Ant tasks or Ant related information.

Let me know if you experience any issues with it, or have comments.


Erik Hatcher wrote:
> Look in the Lucene sandbox in CVS.  I contributed an Ant task that 
> indexed HTML documents.  It uses JTidy under the covers to parse HTML 
> into title and body content, and it could be extended to pull other 
> information such <meta> keywords.
>     Erik
> Leo Galambos wrote:
>> So, I have tried this with Lucene:
>> 1) original JavaCC LL(k) HTML parser
>> 2) SWING's HTML parser
>> In case of (1) I could process about 300K of HTML documents. In case 
>> of (2) more than 400K.
>> But I cannot process complete collection (5M) and finish my hard stress
>> tests of Lucene.
>> Is there anyone who has HTML parser that really works with Lucene? :) If
>> you think that you have one, please let me know. I wanted to try Neko, 
>> but it looks complicated and I do not want to affect the results by 
>> ``robust'' parser.
>> THX
>> -g-
>> --
>> To unsubscribe, e-mail:   
>> <>
>> For additional commands, e-mail: 
>> <>
> --
> To unsubscribe, e-mail:   
> <>
> For additional commands, e-mail: 
> <>

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message