lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <li...@ehatchersolutions.com>
Subject Re: HTML saga continues...
Date Thu, 12 Dec 2002 19:59:19 GMT
On a related note, I've also released a project that I developed for my 
book and for presentations that I have been giving on Ant, XDoclet, and 
JUnit.  This project is a documentation search engine with a web 
(Struts) interface.  It uses Lucene and the Ant task I mentioned already 
to index a directory full of HTML and text files.  The sample data 
provided is Ant's documentation.

Its available as version 0.3 (currently, but always grab the latest 
thats there) at http://www.ehatchersolutions.com/downloads/

I have not documented it well yet, but that is my plan over the next 
couple of weeks.

To get it running you need:

- Ant 1.5.1 (1.5 is not sufficient)
- JUnit 3.8 or up (3.8.1 is the latest)
- j2ee.jar - I don't provide this in the download for size (and legal?) 
reasons.

Build it this way:

	ant -Dj2ee.jar=/path/to/my/j2ee.jar

Or if you run it without the -D switch it will tell you where to place 
j2ee.jar by default.  If you have J2EE_HOME set it will pick that up 
automatically and use it appropriately.

Deploy the WAR in a web container, or the EAR in JBoss.  Navigate to:

	http://localhost:8080/ant-sample/

and search for your favorite Ant tasks or Ant related information.

Let me know if you experience any issues with it, or have comments.

	Erik

Erik Hatcher wrote:
> Look in the Lucene sandbox in CVS.  I contributed an Ant task that 
> indexed HTML documents.  It uses JTidy under the covers to parse HTML 
> into title and body content, and it could be extended to pull other 
> information such <meta> keywords.
> 
>     Erik
> 
> 
> Leo Galambos wrote:
> 
>> So, I have tried this with Lucene:
>> 1) original JavaCC LL(k) HTML parser
>> 2) SWING's HTML parser
>>
>> In case of (1) I could process about 300K of HTML documents. In case 
>> of (2) more than 400K.
>>
>> But I cannot process complete collection (5M) and finish my hard stress
>> tests of Lucene.
>>
>> Is there anyone who has HTML parser that really works with Lucene? :) If
>> you think that you have one, please let me know. I wanted to try Neko, 
>> but it looks complicated and I do not want to affect the results by 
>> ``robust'' parser.
>>
>> THX
>>
>> -g-
>>
>>
>> --
>> To unsubscribe, e-mail:   
>> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
>> For additional commands, e-mail: 
>> <mailto:lucene-user-help@jakarta.apache.org>
>>
>>
>>
> 
> 
> --
> To unsubscribe, e-mail:   
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: 
> <mailto:lucene-user-help@jakarta.apache.org>
> 
> 
> 


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message