lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: HTML Parsing problems...
Date Sat, 20 Sep 2003 00:08:00 GMT
I'm going to swap in the neko HTML parser for the demo refactorings I'm  
doing.  I would be all for replacing the demo HTML parser with this.

If you look at the Ant <index> task in the sandbox, you'll see that I  
used JTidy for it and it works well, but I've heard that neko is faster  
and better so I'll give it a try.

	Erik


On Thursday, September 18, 2003, at 04:50  PM, Michael Giles wrote:

> I know, I know, the HTML Parser in the demo is just that (i.e. a  
> demo), but I also know that it is updated from time to time and  
> performs much better than the other ones that I have tested.   
> Frustratingly, the very first page I tried to parse failed  
> (<http://www.theregister.co.uk/content/54/32593.html>http:// 
> www.theregister.co.uk/content/54/32593.html). It seems to be choking  
> on tags that are being written inside of JavaScript code (i.e.  
> document.write('</scr' + 'ipt>');.  Obviously, the simple solution  
> (that I am using with another parser) is to just ignore everything  
> inside of <script> tags.  It appears that the parser is ignoring text  
> inside script tags, but it seems like it needs to be a bit smarter (or  
> maybe dumber) about how it deals with this (so it doesn't get confused  
> by such occurrences).  I see a bug has been filed regarding trouble  
> parsing JavaScript, has anyone given it thought?
>
> Outside of the HTML parsing, all is well (and outside of a few pages,  
> the parser is a champ).
>
> Thanks!
> -Mike 
>  


Mime
View raw message