lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: demo HTML parser question
Date Thu, 23 Sep 2004 16:10:14 GMT
Hi Fred,

We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.  That's just a guess, I would love to hear
from others about this.  Anyway, since it is a separate thread, a token error
could kill it and there is no way for the calling thread to know about it.

We had to create our own html parser since we only cared about grabbing the
entire text from the html document and also we wanted to avoid the extra
thread.  We also do a lot of "SKIP"ping for minimal EOF errors (html documents
in email almost never follow standards).  For your html needs, you might want
to check out other JavaCC HTML parsers from the JavaCC web site.


On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote
> Hi,
> I've been working with the HTML parser demo that comes with
> Lucene and I'm trying to understand why it's multi-threaded,
> and, more importantly, how to exit gracefully on errors.
> I've discovered if I throw an exception in the front-end static
> code (main(), etc.), the JVM hangs instead of exiting. Presumably
> this is because there are threads hanging around doing something.
> But I'm not sure what!
> Any pointers? I just want to exit gracefully on an error such as
> a required meta tag is missing or similar.
> Thanks,
> Fred
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message