lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Galambos <>
Subject HTML saga continues...
Date Thu, 12 Dec 2002 19:12:53 GMT
So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case of 
(2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, but 
it looks complicated and I do not want to affect the results by ``robust'' 



To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message