lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: HTML saga continues...
Date Thu, 12 Dec 2002 21:13:16 GMT
Yeah, Neko is not the most straight forward, but it works.
Sorry, the code is somewhere.....can;t look for it now.
But you could also look at LARM under Lucene Sanbox, it's got a nice
HTML parser, too.

Otis

--- Leo Galambos <galambos@com-os2.ms.mff.cuni.cz> wrote:
> So, I have tried this with Lucene:
> 1) original JavaCC LL(k) HTML parser
> 2) SWING's HTML parser
> 
> In case of (1) I could process about 300K of HTML documents. In case
> of 
> (2) more than 400K.
> 
> But I cannot process complete collection (5M) and finish my hard
> stress
> tests of Lucene.
> 
> Is there anyone who has HTML parser that really works with Lucene? :)
> If
> you think that you have one, please let me know. I wanted to try
> Neko, but 
> it looks complicated and I do not want to affect the results by
> ``robust'' 
> parser.
> 
> THX
> 
> -g-
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message