lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <li...@ehatchersolutions.com>
Subject Re: HTML saga continues...
Date Thu, 12 Dec 2002 19:21:31 GMT
Look in the Lucene sandbox in CVS.  I contributed an Ant task that 
indexed HTML documents.  It uses JTidy under the covers to parse HTML 
into title and body content, and it could be extended to pull other 
information such <meta> keywords.

	Erik


Leo Galambos wrote:
> So, I have tried this with Lucene:
> 1) original JavaCC LL(k) HTML parser
> 2) SWING's HTML parser
> 
> In case of (1) I could process about 300K of HTML documents. In case of 
> (2) more than 400K.
> 
> But I cannot process complete collection (5M) and finish my hard stress
> tests of Lucene.
> 
> Is there anyone who has HTML parser that really works with Lucene? :) If
> you think that you have one, please let me know. I wanted to try Neko, but 
> it looks complicated and I do not want to affect the results by ``robust'' 
> parser.
> 
> THX
> 
> -g-
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
> 
> 
> 


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message