lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From roy-lucene-u...@xemaps.com
Subject Re: demo HTML parser question
Date Thu, 23 Sep 2004 19:48:56 GMT
On Thu, 23 Sep 2004 10:53:26 -0700, Doug Cutting wrote
> roy-lucene-user@xemaps.com wrote:
> > We were originally attempting to use the demo html parser (Lucene 1.2), but as
> > you know, its for a demo.  I think its threaded to optimize on time, to allow
> > the calling thread to grab the title or top message even though its not done
> > parsing the entire html document.
> 
> That's almost right.  I originally wrote it that way to avoid having 
> to ever buffer the entire text of the document.  The document is 
> indexed while it is parsed.  But, as observed, this has lots of 
> problems and was probably a bad idea.
> 
> Could someone provide a patch that removes the multi-threading?  
> We'd simply use a StringBuffer in HTMLParser.jj to collect the text. 
>  Calls to pipeOut.write() would be replaced with text.append().  
> Then have the HTMLParser's constructor parse the page before 
> returning, rather than spawn a thread, and getReader() would return 
> a StringReader.  The public API of HTMLParser need not change at all 
> and lots of complex threading code would be thrown away.  Anyone 
> interested in coding this?

While we're on the subject...

When using the HTMLParser I tend to get a lot of token manager errors that
basically kill the thread (usually unexpected EOF).  Even if we were to remove
the multi-threading of the HTMLParser, these token manager errors would pretty
much kill the calling app (Error vs Exception).  Any idea how to get around this?

Perhaps this question really belongs on the javacc list?

Roy.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message